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We develop minimax optimal risk bounds for the general learning 
task consisting in predicting as well as the best function in a reference 
set Q up to the smallest possible additive term, called the convergence 
rate. When the reference set is finite and when n denotes the size of 
the training data, we provide minimax convergence rates of the form 
C(— ^7^)" with tight evaluation of the positive constant C and with 
exact <v < 1, the latter value depending on the convexity of the 
loss function and on the level of noise in the output distribution. 

The risk upper bounds are based on a sequential randomized al- 
gorithm, which at each step concentrates on functions having both 
low risk and low variance with respect to the previous step prediction 
function. Our analysis puts forward the links between the probabilis- 
tic and worst-case viewpoints, and allows to obtain risk bounds un- 
achievable with the standard statistical learning approach. One of the 
key ideas of this work is to use probabilistic inequalities with respect 
to appropriate (Gibbs) distributions on the prediction function space 
instead of using them with respect to the distribution generating the 
data. 

The risk lower bounds are based on refinements of the Assouad 
lemma taking particularly into account the properties of the loss func- 
tion. Our key example to illustrate the upper and lower bounds is to 
consider the L^-regression setting for which an exhaustive analysis of 
the convergence rates is given while q ranges in [l;-|-oo[. 



1. Introduction. We are given a family Q of functions and we want to 
learn from data a function that predicts as well as the best function in Q up 
to some additive term called the convergence rate. Even when the set G is 
finite, this learning task is crucial since: 
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• any continuous set of prediction functions can be viewed through its cover- 
ing nets with respect to (w.r.t.) appropriate (pseudo-)distances and these 
nets are generally finite; 

• one way of doing model selection among a finite family of submodels is 
to cut the training set into two parts, use the first part to learn the best 
prediction function of each submodel and use the second part to learn a 
prediction function which performs as well as the best of the prediction 
functions learned on the first part of the training set. 

From this last item, our learning task for finite Q is often referred to 
as model selection aggregation. It has two well-known variants. Instead of 
looking for a function predicting as well as the best in Q, these variants want 
to perform as well as the best convex combination of functions in Q or as well 
as the best linear combination of functions in Q. These three aggregation 
tasks are linked in several ways (see [45] and references within). 

Nevertheless, among these learning tasks, model selection aggregation has 
rare properties. First, in general an algorithm picking functions in the set Q 
is not optimal (see, e.g., [9], Theorem 2, [40], Theorem 3, [21], page 14). 

This means that the estimator has to look at an enlarged set of prediction 
functions. Second, in the statistical community, the only known optimal 
algorithms are all based on a Cesaro mean of Bayesian estimators (also 
referred to as progressive mixture rule). Third, the proof of their optimality 
is not achieved by the most prominent tool in statistical learning theory: 
bounds on the supremum of empirical processes (see [48], and refined works 
as [13, 17, 37, 42] and references within). 

The idea of the proof, which comes back to Barron [11], is based on a 
chain rule and appeared to be successful for least square and entropy losses 
[12, 19, 20, 21, 53] and for general loss in [34]. 

In the online prediction with expert advice setting, without any prob- 
abilistic assumption on the generation of the data, appropriate weight- 
ing methods have been shown to behave as well as the best expert up 
to a minimax-optimal additive remainder term (see [26, 43] and references 
within). In this worst-case context, amazingly sharp constants have been 
found (see in particular [24, 25, 33, 54]). These results are expressed in cu- 
mulative loss and can be transposed to model selection aggregation to the 
extent that the expected risk of the randomized procedure based on sequen- 
tial predictions is proportional to the expectation of the cumulative loss of 
the sequential procedure (see Lemma 4.3 for precise statement). 

This work presents a sequential algorithm, which iteratively updates a 
prior distribution put on the set of prediction functions. Contrary to pre- 
viously mentioned works, these updates take into account the variance of 
the task. As a consequence, posterior distributions concentrate on simul- 
taneously low risk functions and functions close to the previously drawn 
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prediction function. This conservative law is not surprising in view of previ- 
ous works on high-dimensional statistical tasks, such as wavelet thresholding, 
shrinkage procedures, iterative compression schemes [5] and iterative feature 
selection [1]. 

The paper is organized as follows. Section 2 introduces the notation and 
the existing algorithms. Section 3 proposes a unifying setting to combine 
worst-case analysis tight results and probabilistic tools. It details our se- 
quentially randomized estimator and gives a sharp expected risk bound. In 
Sections 4 and 5, we show how to apply our main result under assumptions 
coming respectively from sequential prediction and model selection aggrega- 
tion. While all this work concentrates on stating results when the data are 
independent and identically distributed. Section 4.2 shows that the argu- 
ment underlying the main theorem can be applied for sequential predictions 
in which no probabilistic assumption is made and in which the data points 
come one by one (i.e., not in a batch manner). Section 6 contains algorithms 
that satisfy sharp standard-style generalization error bounds. To the au- 
thor's knowledge, these bounds are not achievable with a classical statistical 
learning approach based on supremum of empirical processes. Here the main 
trick is to use probabilistic inequalities w.r.t. appropriate distributions on 
the prediction function space instead of using them w.r.t. the distribution 
generating the data. Section 7 presents an improved bound for L^-regression 
{q > 1) when the noise has just a bounded moment of order s > q. This 
last assumption is much weaker than the traditional exponential moment 
assumption. Section 8 refines Assouad's lemma in order to obtain sharp 
constants and to take into account the properties of the loss function of the 
learning task. We illustrate our results by providing lower bounds match- 
ing the upper bounds obtained in the previous sections and by improving 
significantly the constants in lower bounds concerning Vapnik-Cervonenkis 
classes in classification. Section 9 summarizes the contributions of this work 
and lists some related open problems. 

2. Notation and existing algorithms. We assume that we observe n pairs 
Zi = {Xi,Yi), . . . , Zn = {Xn,Yn) of input~output and that each pair has been 
independently drawn from the same unknown distribution denoted P. The 
input and output space are denoted respectively X and 3^, so that P is a 
probability distribution on the product space Z = X x y. The target of a 
learning algorithm is to predict the output Y associated with an input X for 
pairs {X, Y) drawn from the distribution P. In this work, Zn+i will denote a 
random variable independent of the training set Z^ = (Zi, . . . , Z„) and with 
the same distribution P. The quality of a prediction function g:X^y is 
measured by the risk (also called expected loss or regret): 

R{g)^Ezr.pL{Z,g), 
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where L{Z,g) assesses the loss of considering the prediction function g on 
the data Z £ Z. The symbol = is used to underline that the equality is a 
definition. When there is no ambiguity on the distribution that a random 
variable has, the expectation w.r.t. this distribution will simply be written 
by indexing the expectation sign E by the random variable. For instance, we 
can write R{g) — L{Z, g). More generally, when they are multiple sources 
of randomness, means that we take the expectation with respect to the 
conditional distribution of Z knowing all other sources of randomness. 

We use L{Z,g) rather than L[Y,g(X)] to underline that our results are 
not restricted to nonregularized losses, where we call nonregularized loss a 
loss that can be written as £[Y,g{X)] for some function i-.y x y ^R. 

For any i £ {0, . . . ,n} , the cumulative loss suffered by the prediction func- 
tion g on the first i pairs of input-output, denoted Z| for short, is 

i 

^iia) -^L{Zj,g), 
j=i 

where by convention we take Sq identically equal to zero. The symbol = is 
used to underline when a function is identical to a constant (e.g., Sq = 0). 
With slight abuse, a symbol denoting a constant function may be used to 
denote the value of this function. 

We assume that the set, denoted Q, of all prediction functions has been 
equipped with a a-algebra. Let P be the set of all probability distributions 
on Q. By definition, a randomized algorithm produces a prediction func- 
tion drawn according to a probability in D. Let V he a set of probability 
distributions on Z in which we assume that the true unknown distribu- 
tion generating the data lies. The learning task is essentially described by 
the 3-tuple {Q,L,P) since we look for a possibly randomized estimator (or 
algorithm) g such that 

sup <^ Ez"R{gz") - mm R{g) \ 
Pep I ' geg J 

is minimized, where we recall that R{g) =E,z,^pL{Z,g). To shorten nota- 
tion, when no confusion can arise, the dependence of gzi^ w.r.t. the training 
sample will be dropped and we will simply write g. This means that 
we use the same symbol for both the algorithm and the prediction function 
produced by the algorithm on a training sample. 

We implicitly assume that the quantities we manipulate are measurable; 
in particular, we assume that a prediction function is a measurable func- 
tion from X to 3^, the mapping {x,y,g) i-^ L[{x,y),g] is measurable, the 
estimators considered in our lower bounds are measurable, .... 

The n-fold product of a distribution fi, which is the distribution of a 
vector consisting of n i.i.d. realizations of /x, is denoted /n®". For instance, 
the distribution of (Zi, . . . , Z„) is P®". 
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The symbol C will denote some positive constant whose value may differ 
from line to line. The set of nonnegative real numbers is denoted M+ = 
[0; +00 [. We define [x\ as the largest integer k such that k <x. To shorten 
notation, any finite sequence oi, . . . , will occasionally be denoted a". For 
instance, the training set is Z^. 

To handle possibly continuous set Q, we consider that ^ is a measurable 
space and that we have some prior distribution vr on it. The set of probabil- 
ity distributions on Q will be denoted Ai. The Kullback-Leibler divergence 
between a distribution p £ M and the prior distribution vr is 

[ +CO, otherwise, 

where ^ denotes the density of p w.r.t. vr when it exists (i.e., p^vr). For 
any p£ M, we have K{p, vr) > and when vr is the uniform distribution on a 
finite set G, we also have K{p,7r) < log|^|. The Kullback-Leibler divergence 
satisfies the duality formula (see, e.g., [22], page 160): for any real- valued 
measurable function h defined on Q, 

(2.1) mnEg^ph{g)+K{p,7r)} = -logEg^^e-'^^^l 

and that the infimum is reached for the Gibbs distribution 

e-Hg) 

Intuitively, the Gibbs distribution 7r„/i concentrates on prediction functions 
g that are close to minimizing the function /i : ^ — > M. 

For any p £ Ai^ Eg^pg : x 1— > Kg^pg{x) = J g{x)p (dg) is called a mixture of 
prediction functions. When Q is finite, a mixture is simply a convex combi- 
nation. Throughout this work, whenever we consider mixtures of prediction 
functions, we implicitly assume that Egr^pg(x) belongs to 3^ for any x so 
that the mixture is a prediction function. This is typically the case when y 
is an interval of M. 

We will say that the loss function is convex when the function g 1— > L{z,g) 
is convex for any z £ Z, equivalently L{z,Kg,^pg) < Kg,^pL{z,g) for any p £ 
A4 and z £ Z. In this work, we do not assume the loss function to be convex 
except when it is explicitly mentioned. 

The algorithm used to prove optimal convergence rates for several differ- 
ent losses (see, e.g., [12, 16, 19, 20, 21, 34, 53]) is the following: 

Algorithm A. Let A > 0. Predict according to :;^J2i=o'^g'^'n-^xj^-9^ 
where we recall that Sj maps a function g £Q to its cumulative loss up to 
time i. 
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In other words, for a new input x, the prediction of the output given by 
Algorithm A is -^Y.i=oI g{x)e~^^'^^^'n:{dg) / j e'^^'^^^-Kidg). Algorithm A 
has also been used with the classification loss. For this nonconvex loss, it 
has the same properties as the empirical risk minimizer on Q [38, 39]. To 
give the optimal convergence rate, the parameter A and the distribution vr 
should be appropriately chosen. When Q is finite, the estimator belongs to 
the convex hull of the set Q. 

From Vovk, Haussler, Kivinen and Warmuth works [33, 51, 52] and the 
link between cumulative loss in online setting and expected risk in the batch 
setting (see Lemma 4.3), an "optimal" algorithm is: 

Algorithm B. Let A > 0. For any i G {0, . . . let hi be a prediction 
function such that 

A * 

If one of the hi does not exist, the algorithm is said to fail. Otherwise it 
predicts according to J27=o ■ 

In particular, for appropriate A > 0, this algorithm does not fail when 
the loss function is the square loss (i.e., L{z,g) = [y — g{x)]'^) and when the 
output space is bounded. Algorithm B is based on the same Gibbs distribu- 
tion vr_;^Sj as Algorithm A. Besides, in [33], Example 3.13, it is shown that 
Algorithm A is not in general a particular case of Algorithm B, and that 
Algorithm B will not generally produce a prediction function in the convex 
hull of G, unlike Algorithm A. In Sections 4 and 5, we will see how both 
algorithms are connected to the SeqRand algorithm presented in the next 
section. 



3. The algorithm and its generalization error bound. The aim of this 
section is to build an algorithm with the best possible minimax convergence 
rate. The algorithm relies on the following central condition for which we 
recall that ^ is a subset of the set Q of all prediction functions and that Ai 
and P are the sets of all probability distributions on respectively Q and Q. 

For any A > 0, let be a real- valued function defined on x ^ x ^ that 
satisfies the following inequality, which will be referred to as the variance 
inequality: 

yp£M 3fr{p) £ V 

sup{Ez^pEg,^^(^) \og¥.y^pe^^^^^^3''^-^^^^3)-&x{Z,g,g')]^ < q. 

The variance inequality is our probabilistic version of the generic algo- 
rithm condition in the online prediction setting (see [51], proof of Theorem 
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Input: A > and vr a distribution on the set Q. 

1. Define po = ■7r(7r) in the sense of the variance inequality and draw a func- 
tion ^0 according to this distribution. Let So{g) = for any g £G. 

2. For any i G {1, . . . ,n}, iteratively define 

(3.1) Si{g) = Si-i{g) + L{Zi,g) + 6x{Zi,g,gi_i) for any ge^. 
and 

Pi = 7r(7r_A5j in the sense of the variance inequahty 

and draw a function gi according to the distribution pj. 

3. Predict with a function drawn according to the uniform distribution on 
the finite set {50, • • • ,ffn}- 

Conditionahy to the training set, the distribution of the output pre- 
diction function wih be denoted p. 



Fig. 1. The SeqRand algorithm. 

1, or more exphcitly in [33], page 11), in which we added the variance func- 
tion 5\. Our results will be all the sharper as this variance function is small. 
To make the variance inequality more readable, let us say for the moment 
that: 

• Without any assumption on "P, for several usual "strongly" convex loss 
functions, we may take 5a = provided that A is a small enough constant 
(see Section 4). 

• The variance inequality can be seen as a "small expectation" inequality. 
The usual viewpoint is to control the quantity L{Z,g) by its expectation 
w.r.t. Z and a variance term. Here, roughly, L{Z,g) is mainly controlled 
by L{Z,g'), where g' is appropriately chosen through the choice of Tf{p), 
plus the additive term 6x. By definition this additive term does not depend 
on the particular probability distribution generating the data and leads 
to empirical compensation. 

• In the examples we will be interested in throughout this work, '/r(p) will 
be equal either to /? or to a Dirac distribution on some function, which is 
not necessarily in Q. 

• For any loss function L, any set V and any A > 0, one may choose 
6x{Z,g,g') = ^[L{Z,g) - L{Z,g')f (see Section 6). 

Our results concern the sequentially randomized algorithm described in 
Figure 1, which for sake of shortness we will call the SeqRand algorithm. 
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Remark 3.1. When 6\{Z, g, g') does not depend on we recover a more 
standard-style algorithm to the extent that we then have 7r_AS', = T^-xSi ■ Pre- 
cisely our algorithm becomes the randomized version of Algorithm A. When 
6\{Z, g, g') depends on g, the posterior distributions tend to concentrate on 
functions having small risk and small variance term. In Section 6, we will 
take 6x{Z,g,g') = ^[L{Z,g) — L(Z,g')]'^. This choice implies a conservative 
mechanism: roughly, with high probability, among functions having low cu- 
mulative risk Sj, (ji will be chosen close to gi~i. 

For any i G {0, . . . , n}, the quantities Si, pi and gi depend on the training 
data only through Z\, where we recall that Z\ denotes {Zi, . . . ,Zi). Besides 
they are also random to the extent that they depend on the draws of the 
functions go,...,gi-i. 

The SeqRand algorithm produces a prediction function, which has three 
causes of randomness: the training data, the way gi is obtained (step 2) and 
the uniform draw (step 3). For fixed Z| (i.e., conditional to let denote 
the joint distribution of = {go, . . . ,gi). The randomizing distribution fi of 
the output prediction function by SeqRand is the distribution on Q corre- 
sponding to the last two causes of randomness. From the previous definitions, 
for any function h-.Q ^R,we have Eg^f^h{g) = Egj^^n„ J2i=o KOi)- Our 
main upper bound controls the expected risk E^n]Eg^^ii((7) of the SeqRand 
procedure. 

Theorem 3.1. Let Ax{g,g') = Ezr^pSx{Z,g,g') for g e G and g' G Q, 
where we recall that 5\ is a function satisfying the variance inequality. The 
expected risk of the SeqRand algorithm satisfies 

Ez^Eg,^^R{g') 

(3.2) 

< mmUg^pR{g)+Eg^pEz^Eg,^f,Ax{g,g') + 

In particular, when Q is finite and when the loss function L and the set V 
are such that 6x = 0, by taking vr uniform on Q , we get 

1 1^1 

(3.3) Ez^Eg^f,R{g) < unnR+ ^^\^y 

Proof. Let £ denote the expected risk of the SeqRand algorithm: 
£ ^ Ez^Eg^i^Rig) = ;^^:^zl^al,r.nM9^)■ 

i=0 

We recall that Zn+i is a random variable independent of the training set 
and with the same distribution P. Let Sn+i be defined by (3.1) for i = n + l. 
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To shorten formulae, let tTj = ir^xSi that by definition we have pi = vr(7ri). 
The variance inequality implies that 

Eg,^,ip)R{9') < -iEzE,,^^(,)logE,^pe-"[^(^'^')+^^(^'^''^')l. 

So for any i £ {0, . . . ,n}, for fixed Qq"^ = {go, . . . ,gi~i) and fixed Z\, we have 

Taking the expectations w.r.t. (Z|,^q~^), we get 



< - ^E^.+i E^. log Eg^f,^ g- A[L(Z,+i ,9,9,)] ^ 



Consequently, by the chain rule (i.e., cancellation in the sum of logarithmic 
terms; [11]) and by intensive use of Fubini's theorem, we get 



1 " 



S = y E^, E^^ Rigi) 

1=0 



1 

< -— -V E^^+iE-, logE.^^.e-^[-^(^»+i'5)+'5^(^^+i'3'901 

- A(n + 1)^ ^1 fo ^ 9 ^^ 

1 " 
— w ,iF-n\^lno-F . p-KL{Zi+i,g)+&x{Zt+i,g,gi)\ 

~ x{n + l)^i ^°^^^^''' 

= ——, -E^n+iEgn > log — 

A(n + 1) ^1 Eg^^e-^^'(9) 

— -E^n + lEgn log ' 



= - A(;^^^r+^% logE,..e-^^-^(^). 

Now from the following lemma, we obtain 

1 -AE ,i+iEsnS„+i(g) 



' .„^1F -A|(„+i)H(j)+E™E,,.2r.„A»(a,a,)l 
A(n + 1) 



: mm |E,.,i?(,) + E,.,Ez,.E,. -^^ + ^^^-^ j . 
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Lemma 3.2. Let W be a real-valued measurable function defined on a 
product space Ai x A2 and let fXi and ^2 be probability distributions on 
respectively Ai and A2 such that E^^^^jj logEa2^;,2e-^(''i'"2) < +00. We 
have 

Proof. By using twice (2.1) and Fubini's theorem, we have 
-E,, logE,,^^,e-^("i'"2) =E,, inf{E,,^pW(ai,a2) +K(p,/i2)} 

< inf E^ jE„,^pW(ai, as) + /is)} 
= -logE„,^^,e-'^"i^('^i''^2), □ 

Inequahty (3.3) is a direct consequence of (3.2). □ 

Theorem 3.1 bounds the expected risk of a randomized procedure, where 
the expectation is taken w.r.t. both the training set distribution and the ran- 
domizing distribution. From the fohowing lemma, for convex loss functions, 

(3.3) implies 

1 1^1 

(3.4) Ezni?(E,.^<7) < r^vaR+^^^, 

where we recall that fi is the randomizing distribution of the SeqRand al- 
gorithm and A is a parameter whose typical value is the largest A > such 
that (5a = 0. 



Lemma 3.3. For convex loss functions, the doubly expected risk of a 
randomized algorithm is greater than the expected risk of the deterministic 
version of the randomized algorithm; that is, if p denotes the randomizing 
distribution, we have 

Proof. The result is a direct consequence of Jensen's inequality. □ 

In [24], the authors rely on worst-case analysis to recover standard-style 
statistical results such as Vapnik's bounds [49]. Theorem 3.1 can be seen as 
a complement to this pioneering work. Inequality (3.4) is the model selection 
bound that is well known for least square regression and entropy loss, and 
that has been recently proved for general losses in [34]. 

Let us discuss the generalized form of the result. The right-hand side 
(r.h.s.) of (3.2) is a classical regularized risk, which appears naturally in 
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the PAC-Bayesian approach (see, e.g., [7, 22, 56]). An advantage of stat- 
ing the result this way is to be able to deal with uncountable infinite Q. 
Even when Q is countable, this formulation has some benefit to the extent 
that for any measurable function /i : ^ — > M, minpQ_M{E,g^ph{g) + K{p, vr)} < 
mmg(zg{h{g) + log7r-i(g)}. 

Our generalization error bounds depend on two quantities A and vr which 
are the parameters of our algorithm. Their choice depends on the precise 
setting. Nevertheless, when Q is finite and with no particular structure a 
priori, a natural choice for vr is the uniform distribution on Q. 

Once the distribution vr is fixed, an appropriate choice for the parameter 
A is the minimizer of the r.h.s. of (3.2). This minimizer is unknown by the 
statistician, and it is an open problem to adaptively choose A close to it. 

4. Link with sequential prediction. This section aims at providing ex- 
amples for which the variance inequality holds, at stating results coming 
from the online learning community in our batch setting (Section 4.1) and 
at providing new results for the sequential prediction setting in which no 
probabilistic assumption is made on the way the data are generated (Sec- 
tion 4.2). 

4.1. From online to batch. In [33, 51, 52], the loss function is assumed 
to satisfy: there are positive numbers ry and c such that 



Remark 4.1. If 5 ^ g-'ji-C^^.s) ig concave, then (4.1) holds for c = 1 (and 
one may take gp = Kg^pg). 

Assumption (4.1) implies that the variance inequality is satisfied both for 
A = r? and 6x{Z,g,g') = (1 - l/c)L{Z,g') and for X = 7]/c and 5x{Z,g,g') = 
(c— l)L{Z,g), and we may take in both cases tt{p) as the Dirac distribution 
at gp. This leads to the same procedure that is described in the following 
straightforward corollary of Theorem 3.1. 

Corollary 4.1. Let g-w^^^. be defined in the sense of (4-1) (for p = 
T^~r)T.i)- Consider the algorithm which predicts by drawing a function in 
{fi'TT-rjSQ ' • • • ' 57r->7s„ } (according to the uniform distribution. Under assump- 
tion (4-1), its expected risk Kz^:^^J2i'=o -^(St^-tii: ) upper bounded by 



(4.1) 



L[{x,y),gp] < --logEg^pe-^^K-'^)'^]. 



(4.2) 




12 



J.-Y. AUDIBERT 



This result is not surprising in view of the fohowing two results. The first 
one comes from worst-case analysis in sequential prediction. 

Theorem 4.2 ([33], Theorem 3.8). Let Q be countable. For any g £ Q, 
let 5]j(g) =J2j=iL{Zj,g) (still) denote the cumulative loss up to time i of 
the expert which always predicts according to function g. Under assumption 
(4-1)' cumulative loss on of the strategy in which the prediction at 
time i is done according to gn_,^s. ^ in the sense of (4-1) (for p = iT-rjT;i_i) 
is bounded by 

(4.3) inf(cS,(ff) + -log7r-i(<7)j. 

The second result shows how the previous bound can be transposed into 
our model selection context by the following lemma. 

Lemma 4.3. Let A be a learning algorithm which produces the prediction 
function A{Z\) at time i + 1, that is, from the data Z\ = (Zi, . . . , Zi). Let 
C be the randomized algorithm which produces a prediction function C{Z^) 
drawn according to the uniform distribution on {A{0), A{Z {).,... ,A{Zi)} . 
The (doubly) expected risk of C is equal to times the expectation of the 
cumulative loss of A on the sequence Zi, . . . , Zn^i. 



Proof. By Fubini's theorem, we have 

1 " 

Ei?[£(zr)] = — -EiEzr^[^(^D] 
1 " 

i=0 

1 " 



For any > 0, let c{rj) denote the infimum of the c for which (4.1) holds. 
Under weak assumptions, Vovk [52] proved that the infimum exists and 
studied the behavior of c(ry) and a{r}) = c{r})/r], which are key quantities of 
(4.2) and (4.3). Under weak assumptions, and in particular in the examples 
given in Table 1, the optimal constants in (4.3) are c{t]) and a{rj) ([52], 
Theorem 1) and we have c{rj) > 1, i— > c{rj) nondecreasing and rj a{rj) 
nonincreasing. From these last properties, we understand the trade-off which 
occurs to choose the optimal ij. 

Table 1 specifies (4.2) in different well-known learning tasks. For instance, 
for bounded least square regression (i.e., when |y| < S for some B > 0), 
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Table 1 

Value of cirf) for different loss functions 



Output space Loss L(Z, g) c(ri) 



Entropy loss 




[0;1] 




c{r]) = 1 if r; < 1 


[33], Example 4.3 






+(i-y)iog(^) 


c{rj) = cxD if > 1 


Absolute loss game 


y = 


[0;1] 


\y^9ix)\ 


V 

21og[2/(l + e-'J)l 


[33], Section 4.2 








= 1 + r;/4 + o{rj) 


Square loss 


y=v 


~B,B] 


[Y-9{X)f 


c{r]) = l if r?<l/(2B^) 


[33], Example 4.4 








c(77) = +oo if r7>l/(2B2) 


Lq-loss 


y=[- 


-B,B] 


\Y-9{XW 


C(77) = l 


(see Theorem 4.4) 






g>i 


ifr,<f^(lA2^-') 



Here B denotes a positive real. 



the generalization error of the algorithm described in Corollary 4.1 when 
r/ = 1/(25^) is upper bounded by 

(4.4) 

The constant appearing in front of the KuUback-Leibler divergence is much 
smaller than the ones obtained in unbounded regression setting even with 
Gaussian noise and bounded regression function (see [19, 34] and [22], page 
87). The differences between these results partly come from the absence 
of boundedness assumptions on the output and from the weighted average 
used in the aforementioned works. Indeed the weighted average prediction 
function, that is, Kg^pg, does not satisfy (4.1) for c = 1 and 7] = l/(2i?^) as 
was pointed out in [33], Example 3.13. Nevertheless, it satisfies (4.1) for c = 1 
and rj < l/(8i?^) (by using the concavity of x i— > on [-l/V2;l/\/2] and 
Remark 4.1), which leads to similar but weaker bound [see (4.2)]. 

Case of the Lq-losses. To deal with these losses, we need the following 
slight generalization of the result given in Appendix A of [35]. 

Theorem 4.4. Let y = [a; 6]. We consider a nonregularized loss func- 
tion, that is, a loss function such that L{Z,g) = £\Y,g{X)] for any Z = 
{X,Y) G Z and some function £:y x y ^R. For any y (z y, let iy be the 
function [y' l{y,y')]. If for any y £ y: 

• iy is continuous on y, 

• £y decreases on [a;y], increases on [y;b] and £y{y) = 0, 

• £y is twice differentiable on the open set {a;y) U {y]b). 



14 



J.-Y. AUDIBERT 



then (4-1) is satisfied for c = 1 and 

where the infimum is taken w. r. t. yi,y and y2 ■ 



Proof. See Section 10.1. □ 



Remark 4.2. This result simplifies the original one to the extent that 
ly does not need to be twice differentiable at point y and the range of values 
for y in the infimum is {yi;y2) instead of (a; 6). 

Corollary 4.5. For the Lg-loss, when y = [—B;B] for some B > 0, 
condition (4-1) is satisfied for c = 1 and 

,<^(1A2-'). 

Proof. We apply Theorem 4.4. By simple computations, the r.h.s. of 
(4.5) is 

(g- i)(j/2 - j/i) 

-B<yi<y<y2<B q{y - yi){y2 - y)[{y - yiY~^ + (?/2 - yY'^] 

= — — inf 

q{2B)'i 0<t<l t(l - t)[ti-^ + (1 - ty-^] 

For 1 < g < 2, the infimum is reached for t = 1/2 and (4.5) can be written as 
f] < For g > 2, since the previous infimum is larger than info<t<i tii^t) ~ 

4, (4.5) is satisfied at least when r] < ^[|gyl- D 

4.2. Sequential prediction. First note that using Corollary 4.5 and Theo- 
rem 4.2, we obtain a new result concerning sequential prediction for Lg-loss. 
Nevertheless this result is not due to our approach but to a refinement of 
the argument in [35], Appendix A. In this section, we will rather concen- 
trate on giving results for sequential prediction coming from the arguments 
underlying Theorem 3.1. 

In the online setting, the data points come one by one and there is no 
probabilistic assumption on the way they are generated. In this case, one 
should modify the definition of the variance function into: for any A > 0, let 
5\ be a real-valued function defined on Z x ^ x ^ that satisfies the following 
online variance inequality: 
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Input: A > and vr a distribution on the set Q. 

1. Define po = ■7r(7r) in the sense of the online variance inequahty and draw a 
function according to this distribution. For data Zi, predict according 
to 5o- Let Soi^g) = for any g £G. 

2. For any i S {1, . . . ,n — 1}, define 

Si{g) = Si-i{g) +L{Zi,g) + 6x{Zi,g,gi_i) for any geQ, 

and 

Pi = 7r(7r_AS'i) sense of the online variance inequality 

and draw a function gi according to the distribution pj. For data Zi+i, 
predict according to gi. 



Fig. 2. The online SeqRand algorithm. 

The only difference with the variance inequality defined in Section 3 is the 
removal of the expectation with respect to Z. Naturally if 5x satisfies the 
online variance inequality, then it satisfies the variance inequality. The on- 
line version of the SeqRand algorithm is described in Figure 2. It satisfies 
the following theorem whose proof follows the same line as the one of The- 
orem 3.1. 

Theorem 4.6. The cumulative loss of the online SeqRand algorithm 
satisfies 



J2^9i-MZi,gi-i] 



i=l 



9o 

i=l i=l 



In particular, when Q is finite, by taking ir uniform on Q , we get 

n 
i=l 

log|t/| 



< min<^ ^L{Zi,g) r.~i^ 5 x{Zi,g,gi-i) + 



Up to the online variance function b\ , the online variance inequality is the 
generic algorithm condition of [33], page 11. So cases where 5a are equal to 
zero are already known. Now new results can be obtained by using that for 
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any loss function L and any A > 0, the online variance inequality is satisfied 
iov 5\{Z,g,g') = ^[L{Z, g) — L{Z, g')]'^ (proofin Section 10.2). The associated 
distribution tt{p) is then just p. In spirit, the result associated with these 
choices is similar to the ones obtained in [27], Section 4, to the extent that it 
gives a bound with second-order terms. Nevertheless, we do not know how 
to properly choose the parameter A whereas the aforementioned work solves 
this problem. More discussion on this topic can be found in [8], Section 4.2. 

5. Model selection aggregation under Juditsky, RigoUet and Tsybakov 
assumptions [34] . The main result of [34] relies on the following assumption 
on the loss function L and the set V of probability distributions on Z in 
which we assume that the true distribution lies. There exist A > and a 
real-valued function ip defined on ^ x ^ such that for any P G "P 

r Ez^pe^^^^^^9')-L(Z,g)] < ^(^g'^g)^ for any^,^' G G, 
(5.1) l'ilj{g,g) = l, foranygGG, 
L the function [g i-^ ip{g' ,g)] is concave for any g' G Q. 

Theorem 3.1 gives the following result. 

Corollary 5.1. Consider the algorithm which draws uniformly its pre- 
diction function in the set {Eg^T^_^^^g, . . . ,Eg^T^_^^^g} . Under assumption 

(5.1) , its expected risA; E^™^^ X]r=o-^(^^3~7r_AE 5) ^-^ upper hounded by 

(5.2) j^.,{E,.,H(,) + i^}, 

Proof. We start by proving that the variance inequality holds with 
5x = 0, and that we may take Tt{p) as the Dirac distribution at the function 
Kgr^pg. By using Jensen's inequality and Fubini's theorem, assumption (5.1) 
implies that 

E,,.,(,)E^.plogE,.,e^[^(^'^')-^(^'^)] 

= Ez.plogE,.,e^[^(^'^«'-p^'')-^(^'S)l 
<logEg^pEz^pe^[^(^'V~pf')-i^(^,.)] 

< logEg^p^/J{Eg:^pg',g) 

< log'il)(Eg>^pg',Eg^pg) 
= 0, 

so that we can apply Theorem 3.1. It remains to note that in this context 
the SeqRand algorithm is the one described in the corollary. □ 
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In this context, the SeqRand algorithm reduces to the randomized version 
of Algorithm A. From Lemma 3.3, for convex loss functions, (5.2) also holds 
for the risk of Algorithm A. Corollary 5.1 also shows that the risk bounds for 
Algorithm A proved in [34], Theorem 3.2, and the examples of [34], Section 
4.2, hold with the same constants for the SeqRand algorithm (provided 
that the expected risk w.r.t. the training set distribution is replaced by the 
expected risk w.r.t. both training set and randomizing distributions). 

On assumption (5.1) we should say that it does not a priori require the 
function L to be convex. Nevertheless, any known relevant examples deal 
with "strongly" convex loss functions and we know that in general the as- 
sumption will not hold for the Support Vector Machine (or hinge loss) func- 
tion and for the absolute loss function. Indeed, without further assumption, 
one cannot expect rates better than l/\/n for these loss functions (see Sec- 
tion 8.3). 

By taking the appropriate variance function 6\{Z, g, g'), it is possible to 
prove that the results in [34], Theorem 3.1, and [34], Section 4.1, hold for 
the SeqRand algorithm (provided that the expected risk w.r.t. the training 
set distribution is replaced by the expected risk w.r.t. both training set 
and randomizing distributions). The choice of 6x{Z,g,g'), which for sake 
of shortness we do not specify, is in fact such that the resulting SeqRand 
algorithm is again the randomized version of Algorithm A. 

6. Standard-style statistical bounds. This section proposes new results 
of a different kind. In the previous sections, under convexity assumptions, we 
were able to achieve fast rates. Here we have assumption neither on the loss 
function nor on the probability generating the data. Nevertheless we show 
that the SeqRand algorithm applied for 6x{Z,g,g') = X[L{Z,g) — L{Z, g')]'^ /2 
satisfies a sharp standard-style statistical bound. 

This section contains two parts: the first one provides results in expecta- 
tion (as in the preceding sections) whereas the second part provides deviation 
inequalities on the risk that require advances on the sequential prediction 
analysis. 

6.1. Bounds on the expected risk. 
6.1.1. Bernstein's type bound. 

Theorem 6.1. Let V{g,g') = E,z{[L{Z,g) - L{Z,g')]'^}. Consider the 
SeqRand algorithm applied with 5x{Z,g,g') = X[L{Z,g) — L{Z, g')]"^ /2 and 
7r{p) = p. Its expected risk Kz^^gr^fi.R{g), where we recall that fi denotes the 
randomizing distribution, satisfies 
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(6.1) 



Proof. See Section 10.2. □ 



To make (6.1) more explicit and to obtain a generalization error bound 
in which the randomizing distribution does not appear in the r.h.s. of the 
bound, the following corollary considers a widely used assumption relating 
the variance term to the excess risk (see Mammen and Tsybakov [41, 47], 
and also Polonik [44]). Precisely, from Theorem 6.1, we obtain: 

Corollary 6.2. // there exist < 7 < 1 and a prediction function g 
(not necessarily in Q) such that V{g,g) < c[R{g) — R(g)\^ for any g £G, the 
expected risk £ = Mz^^gr^/iRig) of the SeqRand algorithm used in Theorem 
6.1 satisfies: 

• When 7 = 1, 

K(p,7r) 



£ - R{g) < mm|i±^[E,^,i?(g) - R{~g)] + — 



peM{l-cX' ' ^"'^ (l-cA)A(n + l) 

In particular, for Q finite, vr the uniform distribution, A = l/(2c), when g 

belongs to Q, we get £ < min^gg R{g) + ^'^n'+j^^ • 

When 7 < 1 , for any < /3 < 1 and for R{g) = R{g) - R{g), 



V 



1/(1-7) 

1-/3. 



Proof. See Section 10.3. □ 



To understand the sharpness of Theorem 6.1, we have to compare this re- 
sult with the following one that comes from the traditional (PAC-Bayesian) 
statistical learning approach which relies on supremum of empirical pro- 
cesses. In the following theorem, we consider the estimator minimizing the 
uniform bound, that is, the estimator for which we have the smallest upper 
bound on its generalization error. 

Theorem 6.3. We still use V{g,g') = Ez{[L{Z,g) - L{Z,g')]'^}. The 
generalization error of the algorithm which draws its prediction function 
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according to the Gibbs distribution vr_As„ satisfies 
EznE^'^^^^^^R{g') 

(6.2) < mmi^Eg^,R{g) + ^^^'^^ + ^ + XEg^pEz^Eg,^^_^^^ V{g, g') 

1 " 1 

+ A- ^ Eg^pEz^E,,^^_^^^ [L{Z,,g) - L{Z,,g')f . 

i=i ) 

Let (f be the positive convex increasing function defined as (p{t) = ^ 

and (p{0) = \ by continuity. When sup^g2,geg,g'ee \^{^i9') ~ ^{^■•9)\ ^ B, 
we also have 

(6.3) < mm |Eg^pi?(g) + \^{\B)Eg^pEz^Ey>^^_,^^ V{g, g') 

Xn 

Proof. See Section 10.4. □ 

As in Theorem 6.1, there is a variance term in which the randomizing dis- 
tribution is involved. As in Corollary 6.2, one can convert (6.3) into a proper 
generalization error bound, that is, a nontrivial bound Ez^Eg^T^ _^^^R{g) < 
B{n, Tr,X) where the training data do not appear in B{n,Tr,X). 

By comparing (6.3) and (6.1), we see that the classical approach requires 
the quantity supg^g gi^g\L{Z,g') — L{Z, g)\ to be uniformly bounded and the 
unpleasing function ip appears. In fact, using technical small expectations 
theorems (see, e.g., [4], Lemma 7.1), exponential moments conditions on the 
above quantity would be sufficient. 

The symmetrization trick used to prove Theorem 6.1 is performed in the 
prediction functions space. We do not call on the second virtual training 
set currently used in statistical learning theory (see [49]). Nevertheless both 
symmetrization tricks end up to the same nice property: we need no bound- 
edness assumption on the loss functions. In our setting, symmetrization on 
training data leads to an unwanted expectation and to a constant four times 
larger (see the two variance terms of (6.2) and the discussion in [5], Section 
8.3.3). 

In particular, deducing from Theorem 6.3 a coroUary similar to Corollary 
6.2 is only possible through (6.3) and provided that we have a bounded- 
ness assumption on sup^^2,geg,g'£g\^i^i9') ~ ^{^^9)1- Indeed one cannot 
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use (6.2) because of the last variance term in (6.2) (since S„ depends on 

Our approach has nevertheless the following limit: the proof of Corollary 
6.2 does not use a chaining argument. As a consequence, in the particular 
case when the model has polynomial entropies (see, e.g., [41]) and when the 
assumption in Corollary 6.2 holds for 7 < 1 (and not for 7 = 1), Corollary 
6.2 does not give the minimax optimal convergence rate. Combining the 
better variance control presented here with the chaining argument is an 
open problem. 



6.1.2. Hoeffding's type bound. Contrary to generalization error bounds 
coming from Bernstein's inequality, (6.1) does not require any bounded- 
ness assumption. For bounded losses, without any variance assumption (i.e., 
roughly when the assumption used in Corollary 6.2 does not hold for 7 > 0), 
tighter results are obtained by using Hoeffding's inequality, that is: for any 
random variable W satisfying a <W <b, then for any A > 

^^\{W-EW) <gA2(6-a)2/8_ 



Theorem 6.4. Assume that for any z E Z and g G , we have a < 
L{z,g) < b for some reals a,b. Consider the SeqRand algorithm applied with 
5x{Z,g,g') = X{b — a)'^/8 and7r{p) = p. Its expected riskE,z^^g^fiR{g), where 
we recall that jl denotes the randomizing distribution, satisfies 

(6.4) ^z^^,^^R{g) < mm {£,.,/?(<,) + ^^^^ + 



In particular, when Q is finite, by taking vr uniform on Q and A 
we get 



81og|g| 
(6-a)^(n+l) ' 



(6.5) E^.E,^^i?(5) -mini?(5) < " «)y 2(^+1) ' 

Proof. From Hoeffding's inequality, we have 

logE,.,e^[^(^'5')-^(^'^)l = logE,.,e^[^«-*w^(^'^')-^(^'^W 

A^(6_a)2 
8 

hence the variance inequality holds for 5\ = \{b — a)^/8 and tt{p) = p. The 
result directly follows from Theorem 3.1. □ 

The standard point of view (see Appendix A. 2) applies Hoeffding's in- 
equality to the random variable W = L{Z,g') — L{Z,g) for g and g' fixed 
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and Z drawn according to the probability generating the data. The previous 
theorem uses it on the random variable W = L(Z^ g') — Kg^pL{Z, g) for fixed 
Z and fixed probability distribution p but for g' drawn according to p. Here 
the gain is a multiplicative factor equal to 2 (see Appendix A. 2). 

6.2. Deviation inequalities. For the comparison between Theorem 6.1 
and Theorem 6.3 to be fair, one should add that (6.3) and (6.2) come from 
deviation inequalities that are not exactly obtainable to the author's knowl- 
edge with the arguments developed here. Precisely, consider the following 
adaptation of Lemma 5 of [55]. 

Lemma 6.5. Let A he a learning algorithm which produces the prediction 
function A{Z'i) at time i + 1, that is, from the data Z\ = (Zi, . . . , Zi). Let 
C be the randomized algorithm which produces a prediction function C{Zi) 
drawn according to the uniform distribution on {A{0) , A{Zi) , . . . , A{Zi)} . 
Assume that sup., ggi \L{z,g) — L{z,g')\ < B for some B >0. Conditionally 
to Zi,. . . , Zn+i, the expectation of the risk of C w.r.t. to the uniform draw is 
J2i=o -^[•^(■^i)] ^'^^ satisfies: for any r/ > and e > 0, for any reference 
prediction function g, with probability at least 1 — e w.r.t. the distribution of 
Zi, . . . , Zn+i, 

1 " 

— ^^i?[yl(ZD]-i?(5) 

i=0 

(6.6) < ^X^{L[Z,+i,^(Zi)] - L(Z,+i,5)} 

+ ^(r/S)— - 5: V[A{Zl),~9] + ^P^y 
n + 1 ~^ r][n + 1) 

where we still use V{g,g') = E,z{[L{Z, g) — L{Z,g')]'^} for any prediction 
functions g and g' and ip{t) = for any t > 0. 

Proof. See Section 10.5. □ 

We see that two variance terms appear. The first one comes from the 
worst-case analysis and is hidden in Y^^=o{L{Zi+i.,A[Z\y\ — L{Zi+i,g)} and 
the second one comes from the concentration result (Lemma 10.1). The pres- 
ence of this last variance term annihilates the benefits of our approach in 
which we were manipulating variance terms much smaller than the tradi- 
tional Bernstein's variance term. 

To illustrate this point, consider for instance least square regression with 
bounded outputs: from Theorem 4.2 and Table 1, the hidden variance term is 



22 



J.-Y. AUDIBERT 



null. In some situations, the second variance term ^^-^ J27=o ^i-^i^D^d] n^a-y 
behave like a positive constant; for instance, this occurs when Q contains 
two very different functions having the optimal risk mmg^gR{g). By opti- 
mizing r/, this will lead to a deviation inequality of order even though 
from (4.4) the procedure has n~ ^-convergence rate in expectation. In [9], 
Theorem 3, in a rather general learning setting, this deviation inequality of 
order n~^/^ is proved to be optimal. 

To conclude, for deviation inequalities, we cannot expect to do better 
than the standard-style approach since at some point we use a Bernstein's 
type bound w.r.t. the distribution generating the data. Besides procedures 
based on worst-case analysis seem to suffer higher fluctuations of the risk 
than necessary (see [9], discussion of Theorem 3). 



Remark 6.1. Lemma 6.5 should be compared with Lemma 4.3. The 
latter deals with results in expectation while the former concerns deviation 
inequalities. Note that Lemma 6.5 requires the loss function to be bounded 
and makes a variance term appear. 



7. Application to Zq-regression for unbounded outputs. In this section, 
we consider the L^-loss: L{Z,g) = \Y — g{X)\'' . As a warm-up exercise, we 
tackle the absolute loss setting (i.e., q = 1). The following corollary holds 
without any assumption on the output (except naturally that if Ez\Y\ < +oo 
to ensure finite risk). 

Corollary 7.1. Let q = l. Assume that sup^gg Ez g{X)'^ < 6^ for some 
5 > 0. There exists an estimator g such that 



'21og|g| 



(7.1) Ei?(o) -mini?(o) < 26,, 

^ ' geg ~ \ n+1 

Proof. Using E^{[|y - g(X)| - \Y - g' {X)\]'^} < Ab"^ and Theorem 6.1, 
the algorithm considered in Theorem 6.1 satisfies E,R{g) — imng,zgR{g) < 



2A6 -I- x(n+i) ' which gives the desired result by taking A = y 2fe^(n+i) • '-' 

Now we deal with the strongly convex loss functions (i.e., q > 1). Using 
Theorem 3.1 jointly with the symmetrization idea developed in the previous 
section allows to obtain new convergence rates in heavy noise situation, 
that is, when the output is not constrained to have a bounded exponential 
moment. 
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Corollary 7.2. Let q> 1. Assume that 

sup \g{x)\<b, for some b>0, 

E\Y\^ < A, for some s>q and ^4 > 0, 

Q finite. 

Let TT be the uniform distribution on Q, Ci > and 

(g-i)A 

, when q < s < 2q — 2, 

q/{s+2) 

, when s >2q — 2. 

The expected risk of the algorithm which draws uniformly its prediction func- 
tion among Eg^T^^^^^g, . . . , E,g^T^_^^^g is upper bounded by 

mini?(5() + C 

^ see 

miiii?(5) +C 

for a quantity C which depends only on Ci, b, A, q and s. 
Proof. See Section 10.6. □ 

Remark 7.1. In particular, for q = 2, with the minimal assumption 
Ey^ < A (i.e., s = 2), the convergence rate is of order n~^/^, and at the op- 
posite, when s goes to infinity, we recover the rate we have under expo- 
nential moment condition on the output. Inequalities with precise constants 
for least square loss can also be found in the technical report [8], Section 7. 
For q > 2, low convergence rates (i.e., n~'^ with 7 < 1/2) appear when the 
moment assumption is weak: E|y|* < A for some ^ > and q < s <2q — 2. 
Convergence rates faster than the standard nonparametric rates n~^/^ are 
achieved for s > 2q — 2. Fast convergence rates systematically occur when 
1 < g < 2 since for these values of q, we have s > q > 2q — 2. Surprisingly, 
for g = 1, the picture is completely different (see Section 8.3.2 for discussion 
and minimax optimality of the results of this section) . 

Remark 7.2. Corollary 7.2 assumes that the prediction functions in Q 
are uniformly bounded. It is an open problem to have the same kind of 
results under weaker assumptions such as a finite moment condition similar 
to the one used in Corollary 7.1. 




n 



log|g|\i-<?/^+2 



n 



when q < s <2q — 2, 
when s >2q — 2, 



24 



J.-Y. AUDIBERT 



8. Lower bounds. The simplest way to assess the quahty of an algorithm 
and of its expected risk upper bound is to prove a risk lower bound saying 
that no algorithm has better convergence rate. This section provides this 
kind of assertion. The lower bounds developed here have the same spirit as 
the ones in [3, 14, 18], ([31], Chapter 15) and ([6], Section 5) to the extent 
that it relies on the following ideas: 

• The supremum of a quantity Q{P) when the distribution P belongs to 
some set V is larger than the supremum over a well-chosen finite subset of 
V, and consequently is larger than the mean of Q{P) when the distribution 
P is drawn uniformly in the finite subset. 

• When the chosen subset is a hypercube of 2™ distributions (see Section 
8.1), the design of a lower bound over the 2"^ distributions reduces to the 
design of a lower bound over two distributions. 

• When a data sequence Zi, . . . ,Zn has similar likelihoods according to two 
different probability distributions, then no estimator will be accurate for 
both distributions: the maximum over the two distributions of the risk 
of any estimator trained on this sequence will be all the larger as the 
Bayes-optimal prediction associated with the two distributions are "far 
away." 

We refer the reader to [15] and [46], Chapter 2, for lower bounds not par- 
ticularly based on finding the appropriate hypercube. Our analysis focuses 
on hypercubes since in several settings they afford to obtain lower bounds 
with both the right convergence rate and close to optimal constants. Our 
contribution in this section is: 

• to provide results for general nonregularized loss functions (we recall that 
nonregularized loss functions are loss functions which can be written as 
L[{x, y),g] = i[y, g{x)] for some function i-.y x y ^M.), 

• to improve the upper bound on the variational distance appearing in As- 
souad's argument, 

• to generalize the argument to asymmetrical hypercubes which, to our 
knowledge, is the only way to find the lower bound matching the upper 
bound of Corollary 7.2 for gr < s < 2q — 2, 

• to express the lower bounds in terms of similarity measures between two 
distributions characterizing the hypercube, 

• to obtain lower bounds matching the upper bounds obtained in the pre- 
vious sections. 

Remark 8.1. In [33], the optimality of the constant in front of the 
(log|^|)/n has been proved by considering the situation when both \Q\ and 
n go to infinity. Note that this worst-case analysis constant is not necessarily 
the same as our batch setting constant. This section shows that the batch 
setting constant is not "far" from the worst-case analysis constant. 
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Besides Lemma 4.3, which can be used to convert any worst-case analy- 
sis upper bounds into a risk upper bound in our batch setting, also means 
that any lower bounds for our batch setting lead to a lower bound in the 
sequential prediction setting (the converse is not true). Indeed the cumula- 
tive loss on the worst sequence of data is bigger than the average cumula- 
tive loss when the data are taken i.i.d. from some probability distribution. 
As a consequence, the bounds developed in this section partially solve the 
open problem introduced in [33], Section 3.4, consisting in developing tight 
nonasymptotical lower bounds. For least square loss and entropy loss, our 
bounds are off by a multiplicative factor smaller than 4 (see Remarks 8.5 
and 8.4). 

This section is organized as follows. Section 8.1 defines the quantities that 
characterize hypercubes of probability distributions and details the links 
between them. It also introduces a similarity measure between probability 
distributions coming from /-divergences (see [28]). We give our main lower 
bounds in Section 8.2. These bounds are illustrated in Section 8.3. 

8.1. Hypercube of probability distributions and f- similarities. 

Definition 8.1. Let ?n G N*. A hypercube of probability distributions 
is a family of 2™ probability distributions on Z 

{P^:<T^(ai,...,a^)G{-;+r} 

having the same first marginal, denoted /i, 

P^{dX) = P(+^...^+)((iX) ^ ^i{dX) for any a G {-; 

and such that there exist: 

• a partition Xq, . . . , Xm of X with ^{Xi) = ■ ■ ■ = ij,[Xm), 

• hi^h2 in y, 

• 0<p- <p+ <l, 

for which for any j G {1, . . . , m}, for any x £ Xj, we have 

(8.1) P^{Y = hi\X = x) =p,^ = l-P-,{Y = h2\X = x), 

and for any x £ Xq, the distribution of Y knowing X = x \s independent of 
a (i.e., the 2"^ conditional distributions are identical). 

In particular, (8.1) means that for any x £ X — Xq, the conditional proba- 
bility of the output knowing the input x is concentrated on two values, and 
that, under the distribution P^, the disproportion between the probabilities 
of these two values is all the larger as p^^ is far from 1/2 for j the integer 
such that X G Xj. 

An example of a hypercube is illustrated in Figure 3. 
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2 



P- = 



2 







A*! A9 A'g A4 A7 



Fig. 3. Representation of a probability distribution of the hypercube. Here the hypercube 
is symmetrical (p- = 1 — p+) with m = 8 and the probability distribution is characterized 
by a = (+,-,+,-,-,+,+,-). 

Remark 8.2. The use of hypercubes in which p+ and p_ are functions 
from X — Xoto [0; 1] and not just constants can be required when smoothness 
assumptions are put on the regression function rf.xh^ P{Y = 1\X = x). 
This is typicahy the case in works on plug-in classifiers [2, 10]. For general 
hypercubes handling these kinds of constraints, we refer the reader to [8], 
Section 8.1. 

Let hi and /12 be distinct output values. For any p G [0; 1] and y £ y, 
consider 



This is the risk of the prediction function identically equal to y when the 
distribution generating the data satisfies P[Y = yi] = p = 1 — P[Y = 7/2] • 
Through this distribution, the quantity 



can be viewed as the risk of the best constant prediction function. 
For any and g_ in [0; 1], introduce 



(8.4) ^pq+,g_{a) = cl)[aq+ + (1 - - a(l){q+) - (1 - a)(/>(g-). 

Definition 8.2. Let {Pa-:a- = (cji, . . . ,cJm) G {-;+}*"} be a hypercube 
of distributions. 



(8.2) 



^p{y) -P^{hi,y) + (1 -p)£{h2,y). 



(8.3) 



Hp) - inf v?p(y) 
y<^y 
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1. The positive integer m is cahed the dimension of the hypercube. 

2. The probabihty w = /i('Vi) = • • • = ^{X^) is called the edge probability. 

3. The characteristic function of the hypercube is the function tp : M_|_ — > 
defined as 

(8.5) V;N = ^(n + l)V,^,,_(-^). 

4. The edye discrepancy of type I of the hypercube is 

(8.6) di^t^ = ^ ii/2) 

mw 

5. The edge discrepancy of type II of the hypercube is defined as 

(8.7) dn^i^Jp+il-p^)-^il-p+)p^f. 

6. A probability distribution Pq on Z satisfying PQ{dX) = fj,{dX) and for 
any xeX - Xq, Po[Y = hi\X = x] = ^ = Po[Y = h2\X = x] wiU be re- 
ferred to as a base of the hypercube. 

7. Let Pq be a base of the hypercube. Consider distributions P^^r] , cr G {— , +} 
admitting the following density w.r.t. Pq- 

p ( 2pcr , when x £ Xi and y = hi, 

-^{x,y) = \ 2[1 -p„], when x £ Xi and y = /i2, 
^ [ 1, otherwise. 

The distributions and Pj^^j will be referred to as the representatives 
of the hypercube. 

8. When the functions p+ and p_ satisfy p+ = 1 — p_ on — A'q, the hy- 
percube will be said to be symmetrical. In this case, the function 2p_|_ — 1 
will be denoted ^ so that 

P+ = , 

(8.8) 

V- = • 

^ 2 

Otherwise it will be said to be asymmetrical. 

9. A (m, tZ;, (in)-hypercube is a constant and symmetrical m-dimensional 
hypercube with edge probability w and edge discrepancy of type II equal 
to dii. 

Let us now give some properties of the quantities that have just been 
defined. The function (j) is concave since it is the infimum of concave (affine) 
functions. Consequently, ipq_^,q^ is concave and nonnegative on [0; 1]. There- 
fore V' is concave and nonnegative on M_|_ with V'(O) = 0, hence ifj is nonde- 
creasing and satisfies 

(8.9) ijj{u)>{uM)tp{l)=mwdi{uM). 
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The edge discrepancies are both nonnegative quantities that are all the 
smaher as p- and p+ become closer. When the function (p is twice differen- 
tiable on ]0; 1[, the edge discrepancy di can be written through 

V'p+,p-(l/2) 

(8.10) 

_ {P+-P-? 
2 

which is proved by integration by parts. 

For a (m, dn)-hypercube, we have m = rh, w = w, (in = dii, = y/du, 
p_ = (1 — \/3n)/2 and = (1 + ^/dii)/2. So when cp is twice differentiable 
on ]p-;p+[, 



'[tAil-tW[tp+ + il-t)p.]\dt, 



Ul) d, = ^j\tA{l-t)] 



dt. 



Definition 8.3. When a probability distribution P is absolutely con- 
tinuous w.r.t. another probability distribution Q, that is, P <C Q, ^ de- 
notes the density of P w.r.t. Q. Let M+ = [0; +oo[. For any concave function 
^ M+, we define the f -similarity between two probability distribu- 
tions as 



^.12) Sf 




if P < Q, 
otherwise. 



We call it /-similarity in reference to /-divergence (see [28]) to which it is 
closely related. Here we use /-similarities since they are the quantities that 
naturally appear in our lower bounds. 

8.2. Generalized Assouad's lemma. We recall that the n-fold product 
of a distribution P is denoted P®"-. We start this section with a general 
lower bound for hypercubes of distributions. This lower bound is expressed 
in terms of a similarity between n-fold products of representatives of the 
hypercube. 

Theorem 8.1. Let V he a set of probability distributions containing a 
hypercube of distributions of characteristic function ijj and representatives 
and . For any training set size n G N* and any estimator g, we 

have 

(8.13) su^\¥.R{g)-minR{g)\>S^{P^^^,P^_^), 



where the minimum is taken over the space of all prediction functions and 
KR{g) denotes the expected risk of the estimator g trained on a sample of 
size n: KR{g) =Ez^^r^p'g>nR{gz^) = E^n^p»„E(x,y)~p-^['5^,5Zi"(^)]- 
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Proof. See Section 10.7. □ 

This theorem provides a lower bound holding for any estimator and ex- 
pressed in terms of the hypercube structure. To obtain a tight lower bound 
associated with a particular learning task, it then suffices to find the hyper- 
cube in V for which the r.h.s. of (8.13) is the largest possible. By providing 
lower bounds of S^{P^,P^) that are more explicit w.r.t. the hypercube 
parameters, we obtain the following results that are more in a ready-to- use 
form than Theorem 8.1. 

Theorem 8.2. Let V he a set of probability distributions containing a 
hypercube of distributions characterized by its dimension m, its edge proba- 
bility w and its edge discrepancies di and dn (see Definition 8.2). For any 
estimator g and training set size n E N* , the following assertions hold: 

1. We have 

sup< 'ER{g) — va.va.R{g) \ > mwdi{l 

(8.14) 

> mwdi{l 

2. When the hypercube satisfies p+ = 1 = 1 — p_, we also have 
(8.15) sup I ER{g)- mm R{g)\ > mwdi{l - w)'' . 

Proof. See Section 10.8. □ 

The lower bound (8.15) is less general than (8.14) but provides results 
with tight constants when convergence rate of order has to be proven 
(see Remarks 8.5 and 8.4). 

Remark 8.3. The previous lower bounds consider deterministic estima- 
tors (or algorithms), that is, functions from the training set space 1J„>q^"' 
to the prediction function space Q. They still hold for randomized estima- 
tors, that is, functions from the training set space to the set T> of probability 
distributions on Q. 

8.3. Examples. Theorem 8.2 motivates the following simple strategy to 
obtain a lower bound for a given set V of probability distributions and a ref- 
erence set Q of prediction functions: it consists in looking for the hypercube 
contained in the set V and for which: 

• the lower bound is maximized. 



-^l-[l-dnr^) 



- \/nwdii). 



30 



J.-Y. AUDIBERT 



1 



1/S 

u 

Fig. 4. Influence of the convexity of the loss on the optimal convergence rate. Let c > 0. 
We consider Lq-losses with g = 1 + c( '°^J^^ )" for it > 0. For such values of q, the optimal 
convergence rate of the associated learning task is of order ( '°^J^^ )" with 1/2 <v < 1. This 
figure represents the value of u in abscissa and the value of v in ordinate. The value u = 
corresponds to constant q greater than 1. For these q, the optimal convergence rate is of 
order while for q = l or "very close" to 1, the convergence rate is of order n^^^^ . 



• for any distribution of the hypercube, G contains a best prediction func- 
tion, that is, mingR{g) = mmgi^gR{g). 

In general, the order of the bound is given by the quantity mwdi and the 
quantities w and du are taken such that nwdji is of order 1. This section 
ihustrates this strategy by: 

• providing learning lower bounds matching up to multiplicative constants 
the upper bounds developed in the previous sections, 

• significantly improving the constants in classification lower bounds for 
Vapnik-Cervonenkis classes, 

• showing that there is no uniform universal consistency for general loss 
functions. 

8.3.1. Lq-regression with hounded outputs. We consider y = [—B; B] and 
^{y^y') = \y ~ y'l'^^ <? > l- The learning task is to predict as well as the best 
prediction function in a finite set G of cardinal denoted \G\- The results 
of this section are roughly summed up in Figure 4, which represents the 
minimax optimal convergence rate for L^-regression. 

• Case I < q < 1 + \J ^'°^^J^ — A 1. From (6.5), there exists an estimator g 
such that 



^.16) ¥.R{g) - mmR{g) < 2(29-i)/25g J l^^M^ 

g&G V n 
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The following corollary of Theorem 8.2 shows that this result is tight. 

Theorem 8.3. Let B > and d£W. For any training set size n G N* 
and any input space X containing at least [log2 d\ points, there exists a set 
Q of d prediction functions such that: for any estimator g there exists a 
probability distribution on the data space X x [—B]B] for which 



W.R{g)-mS.nR{g) > { 



2cqB'^, otherwise, 



where 

1 



Proof. See Section 10.9. □ 



Case g > 1 + Y ^'°^^J^^-^ A 1. We have seen in Section 4 that there exists 
an estimator g such that 

(8.17) ERig) - mm/?(,) < iii^^!^(log2)^°^^ 



The following corollary of Theorem 8.2 shows that this result is tight. 



Theorem 8.4. Let B >0 and deW. For any training set size n G N* 
and input space X containing at least \log2{2d)\ points, there exists a set 
Q of d prediction functions such that: for any estimator g there exists a 
probability distribution on the data space X x [—B;B] for which 

KR{g) - min R{g) > ( / , V e-^B" ( ii^^lM a l) . 
Proof. See Section 10.9. □ 



Remark 8.4. For least square regression (i.e., q = 2), Remark 8.5 holds 
provided that the multiplicative factor becomes 2elog2 w 3.77. More gener- 
ally, the method used here gives close to optimal constants but not the exact 
ones. We believe that this limit is due to the use of the hypercube structure. 
Indeed, the reader may check that for hypercubes of distributions, the upper 
bounds used in this section are not constant-optimal since the simplifying 
step consisting in using minpg_A4 • • • < miuggg • • • is loose. 
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The above analysis for Lg-losses can be generalized to show that there 
are essentialy two classes of bounded losses: the ones which are not convex 
or not enough convex (typical examples are the classification loss, the hinge 
loss and the absolute loss) and the ones which are sufficiently convex (typical 
examples are the least square loss, the entropy loss, the logit loss and the 
exponential loss). For the first class of losses, the edge discrepancy of type I 
is proportional to y/dn for constant and symmetrical hypercubes and (8.14) 
leads to a convergence rate of \/(log|^|)/n. For the second class, the conver- 
gence rate is (log|^|)/n and the lower bound can be explained by the fact 
that, when two prediction functions are different on a set with low proba- 
bility (typically n~^), it often happens that the training data have no input 
points in this set. For such training data, it is impossible to consistently 
choose the right prediction function. 

This picture of convergence rates for finite models is rather well known, 
since: 

• similar bounds (with looser constants) were known before for some cases 
(e.g., in classification; see [30, 50]), 

• mutatis mutandis, the picture exactly matches the picture in the indi- 
vidual sequence prediction literature: for mixable loss functions (similar 
to "sufficiently convex"), the minimax regret is 0(log|^|)/n, whereas for 
0/1-type loss functions, it is 0{^/ (log |^|)/n) (see, e.g., [33]). 

8.3.2. Lq-regression for unbounded outputs having finite moments. The fra- 
mework is similar to the one of Section 8.3.1 except that < B for some 
B > 0" is replaced with "E|y|'^ < A for some s>q and ^ > 0." 

Case q = I. From (7.1), when supg^gEzg{X)^ < iP' for some 6 > 0, there 
exists an estimator for which 



The following corollary of Theorem 8.2 shows that this result is tight. 

Theorem 8.5. For any training set size n G N*, positive integer d, pos- 
itive real number b and input space X containing at least \\0g2 d\ points, 
there exists a set Q of d prediction functions uniformly bounded by b such 
that: for any estimator g there exists a probability distribution for which 
E\Y\ < +00 and 



Proof. Let m = [log2 |^|J • We consider a (m, 1/m, y |^ A l)-hypercube 
with hi = —b and /12 = b. One may check that di = b\/dii so that (8.14) 



ER{g) - mm R{g) < 2bJ{2log\g\)/n. 
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gives that for any estimator there exists a probabihty distribution for which 

E\Y\ < +00 and 




Ei?(5)-mini?(,)>6^-Al(^l- 

hence the desired result. □ 

Case q > 1. First let us recall the upper bound. In Corollary 7.2, under 
the assumptions 

sup \g{x)\ < b, for some 6 > 0, 

Ely]"* < A, for some s>q and A>0, 

Q finite, 

we have proposed an algorithm satisfying 



R{g)-ra{nR{g)< 





ao! 








n 






aoi 








n , 


^l-g/(s+2) 



when q < s < 2q — 2, 



when s >2q — 2, 



for a quantity C which depends only on b, A, q and s. 

The following corollary of Theorem 8.2 shows that this result is tight and 
is illustrated by Figure 5. 

Theorem 8.6. Let d£W , s>q>l, b>0 and yl > 0. For any training 
set size n G N* and input space X containing at least [log2(2(i)J points, there 
exists a set Q of d prediction functions uniformly hounded by b such that: 
for any estimator g there exists a probability distribution on the data space 
X xRfor which E\Y\' < A and 



ER{g) - mm R{g) > 











A I 














'logiai 








A I 













for a quantity C which depends only on the real numbers b, A, q and s. 

Both inequalities simultaneously hold but the first one is tight for q < s < 
2q — 2 while the second one is tight for s > 2^ — 2. They are both based on 
(8.14) applied to a [log2 |^|J -dimensional hypercube. Contrary to other lower 
bounds obtained in this work, the first inequality is based on asymmetrical 
hypercubes. The use of these kinds of hypercubes can be partially explained 
by the fact that the learning task is asymmetrical. Indeed all values of the 
output space do not have the same status since predictions are constrained 
to be in [—6; 6] while outputs are allowed to be in the whole real space (see 
the constraints on the hypercube in the proof given in Section 10.10). 
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1/2 
1/q 



q 2q-2 

Fig. 5. Optimal convergence rates m Lq-regression when the output has a finite moment 
of order s (see Theorem 8.6). The convergence rate is of order ( '°^J^^ Y with Q <v <1. 
The figure represents the value of s in abscissa and the value of v in ordinate. Two cases 
have to be distinguished. For 1 < g < 2 (figure on the top), v depends smoothly on q. For 
q > 2 (figure on the bottom), two stages are observed depending whether s is larger than 
2q-2. 




8.3.3. Entropy loss setting. We consider 3^ = [0; 1] and y') = K{y, y'), 
where K{y,y') is the Kullback-Leibler divergence between Bernoulh distri- 
butions with respective parameters y and y' , that is, K{y,y') = ylog{^)+ 
(1 — y) iog{j^). We have seen in Section 4 that there exists an estimator g 
such that 

(8.18) ER(g) - min R(g) < i^^M^. 

geg n 

The following consequence of (8.15) shows that this result is tight. 
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Theorem 8.7. For any training set size n G N*, positive integer d and 
input space X containing at least [log2(2d)J points, there exists a set Q of d 
prediction functions such that: for any estimator g there exists a probability 
distribution on the data space X x [0; 1] for which 

ER{g) - mm R{g) > e-i(log2) (l A '^'^ 



g&g V n + 1 

Proof. Weusea(m, ^Ai;,l)-hypercubewithm= Llog2l^|J = L^^tJ> 
hi=0 and /i2 = 1. Let H{y) denote the Shannon entropy of the Bernoulh 
distribution with parameter y, that is, 

(8.19) H{y) = -ylogy - {1 - y) log(l - y). 

Computations lead to: for any p G [0; 1], 

cl){p) = H{phi + (1 - p)h2) - pH{hi) - (1 - p)H{h2). 

From (8.4) and Definition 8.2, we get 

di = V'i,o,o,i(V2) =<Ao,i(l/2) =i?(l/2) =log2. 

From (8.15), we obtain 

¥.R{g)- mm R{g)> f il^ili^ A iVlog 2) f 1 ^ a — ^— V. 

Then the result follows from [1 — l/(n + 1)]" \ e^^. □ 

Remark 8.5. For \Q\< 2""'"^, the lower bound matches the upper bound 
(8.18) up to the multiplicative factor e « 2.718. For \Q\ > 2"+^^ the size of 
the model is too large and, without any extra assumption, no estimator can 
learn from the data. To prove the result, we consider distributions for which 
the output is deterministic when knowing the input. So the lower bound 
does not come from noisy situations but from situations in which different 
prediction functions are not separated by the data to the extent that no 
input data fall into the (small) subset on which they are different. 

8.3.4. Binary classification. We consider y = {0; 1} and l{y,y') = ly^j/'- 
Since the work of Vapnik and Cervonenkis [50] , several lower bounds have 
been proposed and the most achieved ones are given in [30], Chapter 14. 
The following theorem provides an improvement of the constants of some of 
these bounds by a factor greater than 1000. 

Theorem 8.8. Let L £ [0; 1/2], n G N and Q be a set of prediction func- 
tions of VC-dimension V '>2. Consider the set Vl of probability distribu- 
tions on X X {0; 1} such that mig^gR{g) = L. For any estimator g: 
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when L = 0, there exists P G for which 

V-l 
2e(n + l)' 



^.20) Ei?(^) - inf i?(c/) > <^ 



1 
I 2 



1 



when n>V — 2, 
otherwise, 



when < L < 1/2, there exists ¥ G Vl for which 



ER{g)-mfR{g) 



'L(y-i)^2(y-i) 



> 



V 32n 
1-2L 

~6~' 



27n 



{\-2Lfn 4 

men > -, 

V - 9' 



otherwise, 



there exists a probability distribution for which 
1.22) 



W.R{g)-mi^R{g)>\\\-. 



n 



Sketch of the proof. For hi^h2, we have (^(p) =p ^{l — p) and 
for symmetrical hypercubes di = y/dii/2. Then (8.20) comes from (8.15) and 
the use of a (F — 1, l/(n + 1), l)-hypercube and a iy, l/V, l)-hypercube. 

To prove (8.21), from (8.14) and the use of a (F- 1, ^^)-hypercube, 
^ 1' 9n(i-2L)^ , (l-2L)^)-hypercube and a {V, l/V, (1 - 2L)^)-hypercube, 
we obtain 



¥.R{g) - inmg) 



> < 



l L{V-l) 
V 32n ' 

2(^-1) 
27ri(l - 2L)' 

1-2L . 

1 



'(l-2L)2n 



l-2L2n L 1-2L2 

when ^ — > - V —-. 

V-l - 2 8L 

(l-2Lfn 4 

when > — , 

V-l -9' 



always. 



which can be weakened into (8.21). Finally, (8.22) comes from the last in- 
equality and by choosing L such that 1 — 2L = ^ ^jV/n. □ 



In an asymptotical setting, [8], Section 8.4.3, provides a refinement of 
(8.22). 
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8.3.5. No uniform universal consistency for general losses. This type of 
result is well known and tells that there is no guarantee of doing well on finite 
samples. In a classification setting, when the input space is infinite, that is, 
\X\ = +00, by using a {\na\ , l/[naj , l)-hypercube with a tending to infinity, 
one can recover that: for any training sample size n, "any discrimination rule 
can have an arbitrarily bad probability of error for finite sample size" [29] , 
precisely: 



where the infimum is taken over all (possibly randomized) classification 
rules. For general loss functions, as soon as 1*^1 = +00, we can use 
([naj , l/[nQj , l)-hypercubes with a tending to infinity and obtain 



where ip is the function defined in (8.4). 

9. Summary of contributions and open problems. This work has devel- 
oped minimax optimal risk bounds for the general learning task consisting 
in predicting as well as the best function in a reference set. It has proposed 
to summarize this learning problem by the variance function appearing in 
the variance inequality (Section 3). The SeqRand algorithm (Figure 1) based 
on this variance function leads to minimax optimal convergence rates in the 
model selection aggregation problem, and our analysis gives a nice unified 
view to results coming from different communities. 

In particular, results coming from the online learning literature are recov- 
ered in Section 4.1. The generalization error bounds obtained by Juditsky, 
Rigollet and Tsybakov in [34] are recovered for a slightly different algorithm 
in Section 5. 

Without any extra assumption on the learning task, we have obtained a 
Bernstein's type bound which has no known equivalent form when the loss 
function is not assumed to be bounded (Section 6.1.1). When the loss func- 
tion is bounded, the use of Hoeffding's inequality w.r.t. Gibbs distributions 
on the prediction function space instead of the distribution generating the 
data leads to an improvement by a factor 2 of the standard-style risk bound 
(Theorem 6.4). 

To prove that our bounds are minimax optimal, we have refined Assouad's 
lemma particularly by taking into account the properties of the loss function. 
Theorem 8.2 is tighter than previous versions of Assouad's lemma and easier 
to apply to a learning setting than Fano's lemma (see, e.g., [46]); besides, the 
latter leads in general to very loose constants. It improves the constants of 
lower bounds related to Vapnik-Cervonenkis classes by a factor greater than 
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1000. We have also illustrated our upper and lower bounds by studying the 
influence of the noise of the output and of the convexity of the loss function. 

For the L^-loss with q > 1, new matching upper and lower bounds are 
given: in the online learning framework under boundedness assumption (Corol- 
lary 4.5 and Section 8.3.1 jointly with Remark 8.1), in the batch learn- 
ing setting under boundedness assumption (Sections 4.1 and 8.3.1), in the 
batch learning setting for unbounded observations under moment assump- 
tions (Sections 7 and 8.3.2). In the latter setting, we still do assume that 
the prediction functions are bounded. It is an open problem to replace this 
boundedness assumption with a moment condition. 

Finally this work has the following limits. Most of our results concern 
expected risks and it is an open problem to provide corresponding tight 
exponential inequalities. Besides we should emphasize that our expected 
risk upper bounds hold only for our algorithm. This is quite different from 
the classical point of view that simultaneously gives upper bounds on the 
risk of any prediction function in the model. To our current knowledge, this 
classical approach has a flexibility that is not recovered in our approach. 
For instance, in several learning tasks, Dudley's chaining trick [32] is the 
only way to prove risk convergence with the optimal rate. So a natural 
question and another open problem is whether it is possible to combine the 
better variance control presented here with the chaining argument (or other 
localization argument used while exponential inequalities are available). 

10. Proofs. 

10.1. Proof of Theorem 4-4- First, by a scaling argument, it suffices to 
prove the result for a = and b = 1. For 3^ = [0; 1], we modify the proof in 
Appendix A of [35]. Precisely, claims 1 and 2, with the notation used there, 
become: 

1. If the function / is concave in a{\p;q]), then we have At{q) < Bt{p). 

2. If c > R{z,p,q) for any z G {p',q), then the function / is concave in 
a{[p;q]). 

Up to the missing a (typo), the difference is that we restrict ourselves to 
values of z in [p;q]. The proof of claim 2 has no new argument. For claim 
1, it suffices to modify the definition of ^ into j = g A G~^[i{p,xt^i)] £ 
[p;g]. Then we have L{p,x[ ^) < L{p,xt^i) and L{q,x[^) < L{p,xt^i), hence 
oeix't i) > ce{xt^i) and j{x[ ^) > j{xt^i). Now one can prove that / is decreasing 
on a{[p]q]). By using Jensen's inequality, we get 

n 

At{q) = -clog^Vt,a{xt,i) 

i=l 
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> -ciogJ2vt,a{xt,i) 

i=l 

n 

= -clog^t;t,i/[a(2;^ J] 



i=l 



> -clog/ 



> -clog/ 



.1=1 

n 



.1=1 



The end of the proof of claim 1 is then identical. 

10.2. Proof of Theorem 6.1. To check that the variance inequality holds, 
it suffices to prove that for any z ^ Z 



(10.1) 



Ey^plogEg^pe 



\[L{z,g')~L(z,g)\~\^/2[L(z,g')-L{z,g)\^ 



< 0. 



To shorten formulae, let a{g' ,g) = X[L{z,g') — L{z,g)]. By Jensen's inequal- 
ity and the following symmetrization trick, (10.1) holds: 



M9',9)-aH9',g)/2 



(10.2) < iEg/^pEg^pe"^^''^)-"'^^'''^)/^ ^ lEg^^^Eg^^e-"^^^''^')-"'^^''^^)/^ 

<E,,^pE,^pCosh(a(5,5'))e~"'^'''^/' < 1, 

where in the last inequality we used the inequality cosh(t) < e*^/^ for any 
t E R. The result then follows from Theorem 3.1. 

10.3. Proof of Corollary 6.2. To shorten the following formula, let /i 
denote the law of the prediction function produced by the SeqRand algo- 
rithm (w.r.t. simultaneously the training set and the randomizing proce- 
dure). Then (6.1) can be written as: for any p G Al, 

(10.3) Eg,^^,R{9') < ^gr^pRig) + ^Eg^pEg,^^V{g,g') + 

Define R{g) = Rig) — Rig) for any g £ G. Under the generalized Mammen 
and Tsybakov assumption, for any g,g' £G, we have 

'^Vig,g') < Ez^p{[LiZ,g) - LiZ,g)f} +Ez^p{[LiZ,g') - LiZ,g)f} 

<cWig)+cWig'), 
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SO that (10.3) leads to 



(10.4) W.g,^.[R{g') - c\W{g')] < Ky^p[R{g) + cXR^g)] + 



A(n + 1) 



This gives the first assertion. For the second statement, let u = ¥,gi^^R{g') 
and x('") — u — cXu' . By Jensen's inequality, the l.h.s. of (10.4) is lower 



bounded by x{u)- By straightforward computations, for any < /3 < 1, when 



u > (i^)"^'^''^ "'\ xiu) is lower bounded by /3n, which implies the desired 



result. 



10.4. Proof of Theorem 6.3. Let us prove (6.3). Let r[g) denote the em- 
pirical risk oi g ^Q, that is, r{g) = Let p € be some fixed distri- 
bution on Q. From [5], Section 8.1, with probability at least 1—e w.r.t. the 
training set distribution, for any Ai, we have 

< Eg,^^rig') - Eg^pr{g) + Xip{XB)Eg,^pEg^,V{g, g') 
Xn 

Since the Gibbs distribution tt^x^,^ minimizes p^Egi^^r{g') + ^^^^^ , we 
have 

Eg>^^_^^^R{g') 

< Eg^,R{g) + Xy^{XB)Eg>^^_^^^Eg^pV{g,g') 

^ X(p,7r) + log(£-i) 
Xn 

Then we apply the following inequality: 

r+oo rl 

EW<E{WyO)= / F{W>u)du= / e^^F{W > log{e^^)) de 
Jo Jo 

to the random variable 

W = Xn[Eg,^^_^^^R{g') - Eg^.Rig) - X^{XB)Eg,^^_^^^Eg^,Vig,g')] 
-K{p,7r). 

We get EW < 1. At last we may choose the distribution p minimizing the 
upper bound to obtain (6.3). Similarly using [5], Section 8.3, we may prove 
(6.2). 
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10.5. Proof of Lemma 6.5. It suffices to apply the following adaptation 
of Lemma 5 of [55] to 

UZi, . . . , Z,) = L[Zi,A{Z\-^)] - L{Zi,~9)- 

Lemma 10.1. Let (p still denote the positive convex increasing function 
defined as (f{t) = ^ ~^2~* ■ Let h he a real number. For i = 1, . . . ,n + 1, let 

: 2^* — > M he a function uniformly upper hounded hy b. For any rj > 0, e > 0, 
with prohability at least 1 — e w.r.t. the distribution of Zi,. . . , Zn+i, we have 

n+l n+l 

Y,C^iZl,...,Zi)<J2EzMZl,...,Z^) 

1=1 1=1 

(10.5) 

n+l I / — 1\ 

+ wivb) ^zAhZu ...,z,) + ^iSL^, 
7^1 ^ 

where E^. denotes the expectation w.r.t. the distribution of Zi only. 

Remark 10.1. The same type of bounds without variance control can 
be found in [23]. 

Proof of Lemma 10.1. For any z G {0, . . . , n + 1}, define 

i i i 

j=i j=i j=i 

where is the short version of S,j{Zi, . . . , Zj). For any i £ {0, . . . ,n}, we 
trivially have 

(10.6) Vi+i - V'i = 6+1 - ^z,+S+i - V'P{r]b)'^z,+i^f+i- 

Now for any 6 G M, rj > and any random variable W such that W <b a.s., 
we have 

(10.7) ^^rjiW-EW-rj<fi[vb)EW^) ^ 

Remark 10.2. The proof of (10.7) is standard and can be found, for 
example, in [4], Section 7.1.1. We use (10.7) instead of the inequality used 
to prove Lemma 5 of [55], that is, Ee'^t^-'^^-''^^'''''^'^^^-^^)'! < 1 for - 
MW < b' since we are interested in excess risk bounds. Precisely, we will take 
W of the form W = L{Z,g) — L{Z,g') for fixed functions g and g' . Then we 
have W < sup^^^L — inf^^^L while we only have W — KW < 2{sup2 gL — 
infz^g L). Besides, the gain of having W.{W — instead of EVF^ is useless 

in the applications we develop here. 
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By combining (10.7) and (10.6), we obtain 
(10.8) Ez,+ie''(^'+i-^») < 1. 

By using Markov's inequality, we upper bound the following probability 
w.r.t. the distribution of Zi, . . . , Zn^\: 

(n+l n+1 n+1 i / — 1\\ 

E > E lEz.^. + w(^^) E Ez.e.^ + 
j=l i=\ %=\ ^ ) 

= P(r/V'n+i>log(e-^)) 
= P(ee''^'"+i > 1) 

< eE^,(e''('^i-'^o)E22(- • • e''('^"-'^"-i)E^,^^.^e''(^"+i-'^"))) 

where the last inequality follows from recursive use of (10.8). □ 

10.6. Proof of Corollary 7.2. We start with the following theorem con- 
cerning general loss functions. 

Theorem 10.2. Let B >b> and y = R. Consider a loss function L 
which can be written as L[(x,y),g] = i[y, g{x)], where the function £ : M x M — > ^ 
satisfies: there exists Aq > such that for any y G [—B;B], the function 
y' e"'**"^'^^'^ ^ is concave on [—b;b]. Let 

A{y)= sup [£iy,a)-£{y,P)]. 

\a\<b,\P\<b 

For A G (0;Ao], consider the algorithm that draws uniformly its prediction 
function in the set {Eg^T^_^^^g, . . . ,Kg^T^_^^^g} , and consider the determin- 
istic version of this randomized algorithm. The expected risk of these algo- 
rithms satisfies 

/ J n \ 



__9 

j=0 

1 



(10.9) 



i=0 

K{p,7T) 



< mini E„^gR(g) + . , 
- p&m\ ^ ^ A(n + 1) 

r AA^(i^) r 1 1 1 

+ 2 lAA(y)<l;|y|>B + ~ ^ lAA(Y)>l;|y|>B|- 
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Proof. The first inequality follows from Jensen's inequality. Let us 
prove the second. According to Theorem 3.1, it suffices to check that the 
variance inequality holds for < A < Aq, t^{p) the Dirac distribution at E,g^pg 
and 

;i-c)2AA2(y)- 



CA(y) + 



o<C<i 

AA2(y) 

2 lAA(s/)<l;|y|>B + 



'\V\>B 



LAA(y)>l;|s/|>B- 



For any z = {x,y) £ Z such that \y\ < B, for any probability distribution 
p and for the above values of A and 5x, by Jensen's inequality, we have 



-A^[s/,9(x)] 



<gAL(^,E^,^^3')(]E^^^e- 



~Aof[y,g(x)]\A/Ao 



^^\i[y,Eg,^,9'ix)]-Xi[y,Eg^pgix)] 
= 1, 

where the last inequality comes from the concavity of y' i— > e~^°^^^'^'-* . This 
concavity argument goes back to [36], Section 4, and was also used in [19] 
and in some of the examples given in [34]. 

For any z = {x,y) £ Z such that |y| > B, for any < C < 1, by using twice 
Jensen's inequality and then by using the symmetrization trick presented 
in Section 6, we have 

,X[L{z,f:^,^^g')-Liz,g)-5x{z,g,g')] 



= ^-Sx{y)^^^^^eHL{z,Eg,^^g')-L{z,9)] 



X e 



Xmz,g'yLiz,g)]+l/2X''il^O^[L{^,9')~Liz,g)]'^ 



} 



<e~'^^(^'Eg^pEg/^p{e^(^-^)[^(^'^')-^(^'S)l-^/2A2{i-c)2[L(2,9')-i(^,9)]' 

,ACA{j/)+l/2A2(l-C)^A2(y) 



X e 



} 



< e 



ACA(s/)+(1/2)A2{1~C)'A2(s/) 



Taking ( G [0; 1] minimizing the last r.h.s., we obtain that 



]^^^^QKLiz,f^a'r^p9')~L(z,gyS\iz,g,g')] < x. 
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From the two previous computations, we obtain that for any z^Z^ 

lQg]E^^^g^[-f'(2.IEg'~p9')-'£'{^:9)-5A(z,9,9')] < 

SO that the variance inequahty holds for the above values of A, 7r(/9) and 
and the result follows from Theorem 3.1. □ 

To apply Theorem 10.2, we will first determine Aq for which the func- 
tion : y' I— > e"'**"!'^"^'!' is concave. For any given y G [— -B; -B], for any > 1, 
straightforward computations give 

C"(y') = [Aog|2/' - - (9 - mm\v' - y^-^g-Aoisz-s/r 

for y' 7^ hence C" < on [—6; 6] — {y} for Aq = Now since the 



derivative C is defined at the point y, we conclude that the function (" is 
concave on [—6; 6], so that we may use Theorem 10.2 with Aq = • 
For any \y\ > b, we have 

2bq{\y\ - by-' < A(y) < 2bqi\y\ + b^^K 

As a consequence, when \y\ >b + (2bqX)~'/^''~'\ we have XA{y) > 1 and 
A(y) — 1/(2A) can be upper bounded by C"|^/|''~^, where the quantity C' 
depends only on b and q. 

For other values of \y\, that is, when b <\y\ <b+ {2bqX)~'^^''~'\ we have 



AA2(y) 

^ lAA(y)<l;|y|>B + 



A(2/) 



1 

2A 



LAA(j/)>l;|y|>B 



: mm 

0<C<1 



CA(2/) + 



(l-C)2AA2(y) 



< -AA2(y)l|^|>B 

<2A6V(|y| + &)'^-'lM>ij 

<C"A|y|2«-2-L|,l>B, 

where C" depends only on b and q. 

Therefore, from (10.9), for any <b< B and A > satisfying A < , 
the expected risk is upper bounded by 

mm{E,.,ii(5) + +IE{C'l>^r-^l|y|>,+(2M)-/(.-);|y|>B} 
(10.10) 

+ E{C"A|y|^''~^l5<|y|<^_^(2bgA)-l/(9-l) }• 

Let us take B = (^y-)^^'^ — b with A small enough to ensure that b < B < 
6+ (26gA)-i/(^-i). This means that A should be taken smaller than some 
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positive constant depending only on b and q. Then (10.10) can be written 



as 



mm 



{E,.,iZ(5) + ^^}+E{C'|yr-il|y|>,+(2,,,)_V(.-i)} 



+ E{C"A|y ^l((g-l)/(gA))l/9_6<|y|<6+(2bgA)-i/('J-i) }• 

The moment assumption on Y implies 

(10.11) a'~m|y|^l|y|>^<^ for any 0<g<s and a>0. 
So we can upper bound (10.10) with 

naini?(5) + M^ + CA(''+i"'?)/(^"^) 
geg An 

+ CA(A(-2.+2)/,i^^^^_^ ^ ^(2-2g+s)(,-l)i^^^^_^)^ 

where C depends only on 6, A, q and s. So we get 
1 



E 



i=0 

< minRig) + + CX^'^'-'^^/^'^-'^ + CA(^-''+2)/n,>2,_2 
g^g An 

< minfi(g) + 1^ + CA(^+i-'')/('?-i)l.<2,.2 + CA(-^+2)/,i^^ 
gee An - 

since ^"^^Y*^ > ""^"'"^ is equivalent to s >2q — 2. By taking A of order of the 
minimum of the r.h.s. (which implies that A goes to when ?i/log|^| goes 
to infinity), we obtain the desired result. 

10.7. Proof of Theorem 8.1. The symbols ai, . . . , Gm still denote the co- 
ordinates of cj G { — ; -|-}"^. For any r S {— ; 0; +}, define 

^j,r — (<7i, • • ■ ,<yj-i,r,aj+i, . . . ,crm) 

as the vector deduced from a by fixing its jth coordinate to r. Since 
and (Tj- belong to {— ; -|-}"^, we have already defined P^j + and P^j _ • Now 
we define the distribution P^jo as P^^^{dX) = n{dX) and 

1 - P^^ ,{Y = h2\X) = P^^ ^^y = hi\X) 

^ , for any X ^ Xj, 

P^{Y = hi\X)., otherwise. 

The distribution P^. ^ differs from P^ only by the conditional law of the out- 
put knowing that the input is in Xj. We recall that P®" denotes the n-fold 
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product of a distribution P. For any r G {—;+}, introduce the likelihood 



A CTj 



ratios for the data = {Zi, . . . , Zn) '■ TTrj{Z'i) = -^jf (Zf ). This quantity is 



independent of the value of a. Let u be the uniform distribution on {—,+}, 
that is, i^({+}) = 1/2 = 1 — }). In the following, denotes the expec- 
tation when a is drawn according to the m-fold product distribution of z^, 
and Ex = Ex~^ . We have 



supj E R{g) -mmR{g)\ 



> sup \E^„^p^^Kz^pJ[Y,g{X)]-mmEzr.pAY,9{X)] 



sup <{ E^n^plS7^Exr^p^(dX) 

ae{-;+}'" I 1 



EY^P,idY\xAY,9{X)] 



(10.12) 



sup <Eyn^p«>nEx 



- minEy^p_(rfy|x)^(l", y) 

m 

Li=0 



> E5-E^„^p_®nEx 



Li=l 



^Ex{lxeA'jIEai,...,<7j_i,aj+i,...,cr™^IEz"^P_®" E^^^^7r^^.j(Z" 



x(v^p.,[ffW]-0K])}. 



The two inequalities in (10.12) are Assouad's argument [3]. For any x ^ X , 
introduce aj(Zf) = ■{z")+n^ \^") ' '^^^ ^^^^ expectation in (10.12) is 

E^^,7r„,,iZ^)iippMX)]-cl>[Pa]) 
= i[^+,,(Zr)+vr„,,(Zn] 

X {a,(Zn^p+[9(^)] + [1 - ajiZ^)]^p_[giX)] 
(10.13) - a,(Zr)</>(p+) - [1 - a,(Zr)]<A(P-)} 



^k+j(^r) + ^-,i(-2T)]{V'aj(Z,")p+ + [l-aj(Zn]p_ [5(-'^)] 
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-a,{Z^)<P{p^)-[l-a,{Zn<P{p.)} 

> +vr_,,(Zr)]{</>(a,(ZrK + [1 - a,{Z^)]p^) 

-a,(Zr)c/.(p+)-[l-a,(Zr)]0(p_)} 



mw V7r_j(Zf) 
so that 



sup< E — mmii(5() 



Now since we consider a hypercube, for any j G {l,...,m}, all the terms 
in the sum are equal. Besides one can check that the last /-similarity does 
not depend on a, and is equal to Pj'^^where we recall that P[_|„] 

and P[_] denote the representatives of the hypercube (see Definition 8.2) 
Therefore we obtain 

sup|EP(5)-minP(5)| >5^(P[^^,PPP. 

10.8. Proof of Theorem 8.2. First, when the hypercube satisfies p+ = 
1 = 1 — p_, from the definition of di given in (8.6), we have S^{P^,P^) = 
mwdi{\ — w)"^ so that Theorem 8.1 implies (8.15). 

Inequality (8.14) is deduced from Theorem 8.1 by lower bounding the ip- 
similarity. Since u i— > u A 1 is a nonnegative concave function defined on M_|_ , 
we may define the similarity 



5a(P,Q)^ j (^^Al)dQ= |(dPA 



where the second equality introduces a formal (but intuitive) notation. Prom 
Theorem 8.1, by using (8.9), we obtain 
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Corollary 10.3. LetV he a set of probability distributions containing 
a hypercube of distributions of characteristic function ip and representatives 
and . For any estimator g, we have 

(10.14) sup|Ei?(5) - mini2(5)| > mwdiS,^{P®^ ,P®J^), 
where the minimum is taken over the space of prediction functions. 

The following lemma and (10.14) imply (8.14). 

Lemma 10.4. We have 

(10.15) 5a(P[^]^,pPP > 1 -y'l-[l-dnf'">l - 

Proof. See Section 10.8.1. □ 



10.8.1. Proof of Lemma 10.4- For a G {— , +}, define as the probabil- 
ity on {hi, /12} such that Qa(X — ^1) =Po- = 1 ~ QaiY = ^2)- The following 
lemma relates the A-similarity between representatives of the hypercube and 
the A-similarity between and . 



Lemma 10.5. Consider a convex function 7:!^+— >]R_|_ such that 

7(A;)<5A(Qf ,g?'=) 

for any k G {0, . . . ,n}, where by convention S/\{Q^^ ,Q^^) = 1. For any es- 
timator g, we have 

5A(PffpPPP>7(n^). 



Proof. For any points zi = (xi, yi), . . . , = (x„,y„) in ;f x {/ii,/i2}, 
let C{zi, . . . , Zn) denote the number of Zj for which Xi £ Xi. For any k G 
{0, . . . ,n}, let Bf: = C~^{{k}) denote the subset of {X x {/ii, /12})" for which 
exactly k points are in Xi x {/ii, /i2}. We recall that there are {^) possibilities 
of taking k elements among n and the probability oi X G Xi when X is drawn 
according to /i is w = fi{Xi). Let Zi = Xi x {hi, /i2} and let Zf denote the 
complement of Zi . We have 

^a[P[+] ,^[_] j 

= J lA(^^{zi,...,Zn))dP^j;{zi,...,Zn) 

(10.16) =5:/ lA -i±i(zi)...-i±i(z„) dPH(^i)---^^H(^-) 
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t^o\kJ J(Zrrx{Z^,)--^ \P[-] P[-] J 'J 

Ef^l / iA(5±i(,,)---5±i(.,))dpp]^(.„...,,„) 
E0/^"-'(^o 



fc=0 
n 



>E(^^j(i-^r"'^'7(^) 

where y is a Binomial distribution with parameters n and w. By Jensen's 
inequahty, we have Kj{V) > j[lE,{V)] = j{nw), which ends the proof. □ 

The interest of the previous lemma is to provide a lower bound on the 
similarity between representatives of the hypercube from a lower bound on 
the similarities between distributions much simpler to study. The following 
result lower bounds these similarities. 

Lemma 10.6. For any nonnegative integer k, we have 

(10.17) cSA(Qf ,Q?'=)>l-V^l^]l^^>l-v^. 

Proof. To study divergences (or equivalently, similarities) between k- 
fold product distributions, the standard way is to link the divergence (or 
similarity) of the product with the ones of base distributions. This leads to 
tensorization equalities or inequalities. To obtain a tensorization inequality 
for 5a , we introduce the similarity associated with the square root function 
(which is nonnegative and concave): 

S/F,q)= J VdFdQ 
and use the following lemmas: 

Lemma 10.7. For any probability distributions ¥ and Q, we have 



5A(P,Q)>l-y'l-5j(P, 
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Proof. Introduce the variational distance ^(P, Q) as the /-divergence 
associated with the convex function / : u i |n — 1| . From Scheffe's theorem, 
we have 5a(1P,Q) = 1 — y(P,Q) for any distributions F and Q. Introduce 

the Helhnger distance H, which is defined as H{¥,Q) > and 1 — ^ ^^"'^^ = 
S ^(F,Q) for any probabihty distributions P and Q. The variational and 
Hellinger distances are known (see, e.g., [46], Lemma 2.2) to be related by 



y(p,Q)<^i- 

hence the result. □ 

Lemma 10.8. For any distributions p(^) , . . . , p(*^) , Q^^) , . . . , QC') , we have 
cS^(p(^)®---(g)P(*^),Q(^)®---®( 
= cS^(p(^\Q(^)) X ••• XcS^^ 

Proof. When it exists, the density of P^^^ ® • • • (8) P*^^) w.r.t. Q^^^ (g) • • • (g) 
is the product of the densities of P*-*^ w.r.t. Q^*\ i = 1, . . . ,k, hence the 
desired tensorization equality. □ 

From the last two lemmas, we obtain 



S^iQf, Q^^) > 1 - - 5^ (Q+, Q_). 

Now we have 



= l-[^p+{l-p^)-^{l-p+)p.f 
= l-dii. 

So we get 

(10.18) 5A(Qf , Q?') > 1 - ^l-{l-diif > 1 - VWi, 

where the second inequality follows from the inequality 1 — < k[l — x) 
that holds for any < x < 1 and k>l. This ends the proof of (10.17). □ 

By computing the second derivative of u i-^ \/l — e~", we obtain that this 
function is concave. So for any a G [0; 1], the functions x i— > 1 — \/l — and 
X I— > 1 — y/ax are convex. The convexity of these functions and Lemmas 10.5 
and 10.6 imply Lemma 10.4. 
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10.9. Proofs of Theorems 8.3 and 8.4- We consider a {m, w, dn)-hypercube 
with 

m= Llog2l^|J, 

hi = —B and /i2 = -B, and with w and du to be taken in order to (almost) 
maximize the bound. 



Case 9 = 1. Computations lead to 



di = \h2-hi 



1 



P+ 



A 



/i2 — /ill = B\/dii 



so that, choosing w = l/m, (8.14) gives 



sup < 'ER{g) — mimR{g) \ > B\/dii{\ — Jndu/m). 
Pen i 9 ) V 

Maximizing the lower bound w.r.t. du, we choose dn = ^ A 1 and obtain 
the announced result. 



Case 1<(?<1 + Y^A1. Tedious computations put in Appendix A.l 
lead to: for any p £ [0; 1] , 



(10.19) 
and 

(10.20) 



4>{p) 



\h2 - hi 



4>"{p) 



q-1 



[p(l-p)](2-9)/{9-l) 



h2-hi\i 



\pl/{q-l) + (1 __p)l/(9-l)]g+l ■ 

From (8.11), for any < e < 1, we get 

,///!- Vdu 



di>^ [tA(l-t)] 
2 J{l-e)/2 



dt 



> *1£(^ inf \4>"iu)\ 

2 4 uel(l-eVdri)/2;(l+eVdri)/2] 



e(2-e) , 



e(2-e) , q 
> '-du X — 



1 _ e^du^ (2-g)/{g-i) 



(2S)5 



29+i[(l + e^/dir)/2](9+i)/(9-i) 
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8 5 — 1 

= (1 - e/2)qB'i^{l - eVd^,f-'''>/^''-'\l + eVdrif-"'^^'^''-'\ 
q-l 

Let = (1 - e\/dI7)^^~''^/^''~^Hl + ^VS)^^"^''^/^''"^^- Prom (8.14), taking 
w = 1/m, we get 

(10.21) sup (KR{g) - mmR{g)\ > (1 - e/2)KqB'i^^{l - Jndu/m). 
p&n I 9 ) q-l ^ 

This leads us to choose dn = ^ A 1 and e = (gr — 1) V | < ^ and obtain 

-«)-T-(»)^T-{G#)^('-\/D}- 

Since 1 <q<2 and e\/(In = we may check that K > 0.29 (to be com- 
pared with hmq_>i K = k, 0.37). 

Case q>l + We take w = :^^^■ Prom (8.4), (8.6) and (10.19), 

we get d\ = '(/'i^o,-b,b(1/2) = 0__b,_b(1/2) = B'^. From (8.15), we obtain 

Ei?(5) - mini?(5) > ( ii^^ A l^" f 1 - ^ A j-^^ " 
g&g -\ n + l J \ n + 1 Llog2 I^IJ / 

(10.22) 

>e~^B^(\^^i^Al 



n+1 

where the last inequality uses [1 — l/(n + 1)]" \ e~^. 



Improvement when l + Y^Al<(?<2. From (10.21), by choosing e = 
1/2 and introducing K' = (1 - V^/2)(2-'?)/(9-i) (1 + ^/2)(i-2g)/{g-i)^ 
obtain 

lER{g) - mm R{g)\ > ^^K'-^{1 - Jndn/m). 



sup 

PdH I 3 J 8 gr - 1 



This leads us to choose d\i = ^ f\l. Since y^Al<g — 1, we have ^fdu < 
|(g - 1), hence K'>{1- |(g - i))(2-q)/{g-i)(i + _ -^^^(i-2q)/{g-i)^ p^^, 
any 1 < g < 2, this last quantity is greater than 0.2. So we have proved that 



for 1 + ^f Al<(?<2, 

(10.23) ¥.R{g) - mmR{g) > [logs I g| J 

^ ' geg - 90(g - 1) n 

Theorem 8.4 follows from (10.22) and (10.23). 
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10.10. Proof of Theorem 8.6. 



10.10.1. Proof of the first inequality of Theorem 8.6. Let fn = [log2 \Q\\ . 
Contrary to other lower bounds obtained in this work, this learning set- 
ting requires asymmetrical hypercubes of distributions. Here we consider a 
constant m-dimensional hypercube of distributions with edge probabihty w 
such that p+ =p,p- = 0, hi = +B and /i2 = 0, where w, p and B are posi- 
tive real parameters to be chosen according to the strategy described at the 
beginning of Section 8.3. To have Ejyl'' < A, we need that rhwpB^ < ^4. To 
ensure that a best prediction function has infinite norm bounded by b, from 
the computations at the beginning of Appendix A.l, we need that 

pTKiPT) ^- 

This inequality is in particular satisfied for B = Cp~^/^^~^^ for appropriate 
small constant C depending on b and q. From the definition of the edge 
discrepancy of type II, we have du =p. In order to have the r.h.s. of (8.14) 
of order mwdi , we want to have nwp <C<1. All the previous constraints 
lead us to take the parameters Wjp and B such that 

< mwpB^ = A, 
l nwp =1/4. 

Let Q = f A 1. This leads to p = CQ('?-i)/«^ b = CQ-^/' and w = 
Cfh~^Q^~^'^~^^/^ with C small positive constants depending on 6, A, q and 
s. Now from the definition of the edge discrepancy of type I and (8.10), we 
have 

p2 rl 



* = Tio it/\{i-t)]WiBitp)\dt 

„2 .3/4 I 
2 Jl/4 4 [p/4;3p/4] 

> Cp'p^^-'i^l^'i-^^Bi 



where the last inequality comes from (10.20). From (8.14), we get 

-(<?-!)/« 



sviv\^R{g)-m.mR{g)\ > CQ^' 
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10.10.2. Proof of the second inequality of Theorem 8.6. We still use m = 
[log2 I^IJ . We consider a (m, iD, dn)-liypercube with hi = —B and /12 = +B, 
where w, du and B are positive real parameters to be chosen according to the 
strategy described at the beginning of Section 8.3. To have E|y|^ < A, we 
need that fhwB^ < A. To ensure that a best prediction function has infinite 
norm bounded by b, from the computations at the beginning of Appendix 
A.l, we need that 

^ ■ ^ - [l + ((ill)l/2]l/(9-l) _[1_(4)1/2]1/(<?~1) 

1/2 

For fixed q and b, this inequality essentially means that B < Cdn since 
we intend to take du close to 0. In order to have the r.h.s. of (8.14) of order 
mwdi, we want to have nwdn < 1/4 where, once more, this last constant is 
arbitrarily taken. The previous constraints lead us to choose 

B = Cdu~^^\ 
mwB^ = A, 
nwdii = 1/4. 

We stih use Q = f A 1. This leads to dn = CQ2/(«+2)^ b = CQ-i/("+2) and 
w = Cm~^Q*/(*"'"^) with C small positive constants depending on 6, A, q 
and s. Now from (10.20), we have (/>"(*) > CB^ = CQ-9/(^+2) ^ ^ 
Using (8.11) and (8.14), we obtain 

sup WR{g) - mm R{g) ] > CQ1~«/("+2) _ 
APPENDIX 

A.l. Computations of the second derivative of cj) for the L^-loss. Let 

hi and /i2 be fixed. We start with the computation of (j). For any p G [0; 1], 
the quantity ^p{y) = p\y — hi\^ + (1 — p)\y — h2\'' is minimized when y G 
[hi A/12; /ii V /i2] and pq{y — hi)'^~^ = (1 — p)g(/i2 — yY^^ ■ Introducing r = 

and = p*" + (1 —pY, the minimizer can be written as y = ^ and 
the minimum is 



Di- 

where we use the equality rq = 1 + r. We get 

|^^_\^|, ^^(P) = + P(l - P)(l - g)rZ)-nP'^^' - (1 - pY 
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= D-^{{i - + (1 -pf] - (1 - p)/ -pni 

hence 

1 A,"(p) = -qrD-^~\p'-^ - {I - pf-^H - pY+^ - p'+^] 



\h2 - hi 



9 

9-1 [^'" _ n _ r,\n'^ 



-qrD-'^-'lp'' - {I-pY 

~ ~q-l [pl/(9-l) + (1 - p)l/(9-l)]9+l ■ 

A.2. Expected risk bound from HoefFding's inequality. Let A' > and p 

be a probabihty distribution on Q. Let r((/) denote the empirical risk of a 
prediction function g, that is, r{g) = ^Y^^=iL{Zi,g). Hoeffding's inequality 
applied to the random variable W = Eg^pL(Z, g) — L{Z, g') £ [—{b — a);b — a] 
for a fixed g' gives 

for any 77 > 0. For rj = \' /n, this leads to 

^ X'lR{g')~Eg^pR{g)-r[g')+E,^pr(g)] < g(A')2(6-a)V(2n) 
^1 — 

Consider the Gibbs distribution p = 'Tr_x'r- This distribution satisfies 
^a'r.pr{g') + K{p, 7r)/A' < Eg^prig) + 7r)/A' . 

We have 

Ez^Kg,^pRig') - Eg^pR{g) 

< Ezi^^^Eg,^^[R{g') - Eg^pR{g) - r{g') - Eg^pr{g)] 

K{P,T^) - K{p,tt) 
X' 

A' 1 A' ^ ""^ 

< -^^(P'^) + — logE ,^^E2ne^'[^(f')-'^9-''^(^')^'^(9')-'^s~''^(^^)] 

A' A' 9 1 

j^(p,7r) y(6-a)^ 
- A' 2n ■ 
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This proves that for any A > 0, the generalization error of the algorithm 
which draws its prediction function according to the Gibbs distribution 
^-AS„/2 satisfies 

^z^^9'-^^.^^rM9) < mm|E,.,i?(5) +2 

where we use the change of variable A = 2A'/n in order to underline the 
difference with (6.4). 
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